Repository: mhahsler/dbscan Branch: master Commit: 111f9bc6a376 Files: 154 Total size: 962.1 KB Directory structure: gitextract_jkl9o70t/ ├── .Rbuildignore ├── .github/ │ └── .gitignore ├── .gitignore ├── DESCRIPTION ├── LICENSE ├── NAMESPACE ├── NEWS.md ├── R/ │ ├── AAA_dbscan-package.R │ ├── AAA_definitions.R │ ├── DBCV_datasets.R │ ├── DS3.R │ ├── GLOSH.R │ ├── LOF.R │ ├── NN.R │ ├── RcppExports.R │ ├── broom-dbscan-tidiers.R │ ├── comps.R │ ├── dbcv.R │ ├── dbscan.R │ ├── dendrogram.R │ ├── extractFOSC.R │ ├── frNN.R │ ├── hdbscan.R │ ├── hullplot.R │ ├── jpclust.R │ ├── kNN.R │ ├── kNNdist.R │ ├── moons.R │ ├── ncluster.R │ ├── nobs.R │ ├── optics.R │ ├── pointdensity.R │ ├── predict.R │ ├── reachability.R │ ├── sNN.R │ ├── sNNclust.R │ ├── utils.R │ └── zzz.R ├── README.Rmd ├── README.md ├── data/ │ ├── DS3.rdata │ ├── Dataset_1.rda │ ├── Dataset_2.rda │ ├── Dataset_3.rda │ ├── Dataset_4.rda │ └── moons.rdata ├── data_src/ │ ├── data_DBCV/ │ │ ├── dataset_1.txt │ │ ├── dataset_2.txt │ │ ├── dataset_3.txt │ │ ├── dataset_4.txt │ │ ├── read_data.R │ │ └── test_DBCV.R │ └── data_chameleon/ │ └── read.R ├── dbscan.Rproj ├── inst/ │ └── CITATION ├── man/ │ ├── DBCV_datasets.Rd │ ├── DS3.Rd │ ├── NN.Rd │ ├── comps.Rd │ ├── dbcv.Rd │ ├── dbscan-package.Rd │ ├── dbscan.Rd │ ├── dbscan_tidiers.Rd │ ├── dendrogram.Rd │ ├── extractFOSC.Rd │ ├── frNN.Rd │ ├── glosh.Rd │ ├── hdbscan.Rd │ ├── hullplot.Rd │ ├── jpclust.Rd │ ├── kNN.Rd │ ├── kNNdist.Rd │ ├── lof.Rd │ ├── moons.Rd │ ├── ncluster.Rd │ ├── optics.Rd │ ├── pointdensity.Rd │ ├── reachability.Rd │ ├── sNN.Rd │ └── sNNclust.Rd ├── src/ │ ├── ANN/ │ │ ├── ANN.cpp │ │ ├── ANN.h │ │ ├── ANNperf.h │ │ ├── ANNx.h │ │ ├── Copyright.txt │ │ ├── License.txt │ │ ├── ReadMe.txt │ │ ├── bd_fix_rad_search.cpp │ │ ├── bd_pr_search.cpp │ │ ├── bd_search.cpp │ │ ├── bd_tree.cpp │ │ ├── bd_tree.h │ │ ├── brute.cpp │ │ ├── kd_dump.cpp │ │ ├── kd_fix_rad_search.cpp │ │ ├── kd_fix_rad_search.h │ │ ├── kd_pr_search.cpp │ │ ├── kd_pr_search.h │ │ ├── kd_search.cpp │ │ ├── kd_search.h │ │ ├── kd_split.cpp │ │ ├── kd_split.h │ │ ├── kd_tree.cpp │ │ ├── kd_tree.h │ │ ├── kd_util.cpp │ │ ├── kd_util.h │ │ ├── perf.cpp │ │ ├── pr_queue.h │ │ └── pr_queue_k.h │ ├── JP.cpp │ ├── Makevars │ ├── RcppExports.cpp │ ├── UnionFind.cpp │ ├── UnionFind.h │ ├── cleanup.cpp │ ├── connectedComps.cpp │ ├── dbcv.cpp │ ├── dbscan.cpp │ ├── dendrogram.cpp │ ├── density.cpp │ ├── frNN.cpp │ ├── hdbscan.cpp │ ├── kNN.cpp │ ├── kNN.h │ ├── lof.cpp │ ├── lt.h │ ├── mrd.cpp │ ├── mst.cpp │ ├── mst.h │ ├── optics.cpp │ ├── regionQuery.cpp │ ├── regionQuery.h │ ├── utilities.cpp │ └── utilities.h ├── tests/ │ ├── testthat/ │ │ ├── fixtures/ │ │ │ ├── elki_optics.rda │ │ │ ├── elki_optics_xi.rda │ │ │ └── test_data.rda │ │ ├── test-dbcv.R │ │ ├── test-dbscan.R │ │ ├── test-fosc.R │ │ ├── test-frNN.R │ │ ├── test-hdbscan.R │ │ ├── test-kNN.R │ │ ├── test-kNNdist.R │ │ ├── test-lof.R │ │ ├── test-mst.R │ │ ├── test-optics.R │ │ ├── test-opticsXi.R │ │ ├── test-predict.R │ │ └── test-sNN.R │ └── testthat.R └── vignettes/ ├── dbscan.Rnw ├── dbscan.bib └── hdbscan.Rmd ================================================ FILE CONTENTS ================================================ ================================================ FILE: .Rbuildignore ================================================ proj$ ^\.Rproj\.user$ ^cran-comments\.md$ ^appveyor\.yml$ ^revdep$ ^.*\.o$ ^.*\.Rproj$ ^LICENSE README.Rmd data_src ignore ^\.github$ ================================================ FILE: .github/.gitignore ================================================ *.html ================================================ FILE: .gitignore ================================================ # Generated files *.o *.so # History files .Rhistory .Rapp.history .RData *.Rcheck # Example code in package build process *-Ex.R # RStudio files .Rproj.user/ # produced vignettes vignettes/*.html vignettes/*.pdf .Rproj.user # OS stuff .DS* # Personal work directories Work ignore jss ================================================ FILE: DESCRIPTION ================================================ Package: dbscan Title: Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Related Algorithms Version: 1.2.4 Date: 2025-12-18 Authors@R: c( person("Michael", "Hahsler", email = "mhahsler@lyle.smu.edu", role = c("aut", "cre", "cph"), comment = c(ORCID = "0000-0003-2716-1405")), person("Matthew", "Piekenbrock", role = c("aut", "cph")), person("Sunil", "Arya", role = c("ctb", "cph")), person("David", "Mount", role = c("ctb", "cph")), person("Claudia", "Malzer", role = "ctb") ) Description: A fast reimplementation of several density-based algorithms of the DBSCAN family. Includes the clustering algorithms DBSCAN (density-based spatial clustering of applications with noise) and HDBSCAN (hierarchical DBSCAN), the ordering algorithm OPTICS (ordering points to identify the clustering structure), shared nearest neighbor clustering, and the outlier detection algorithms LOF (local outlier factor) and GLOSH (global-local outlier score from hierarchies). The implementations use the kd-tree data structure (from library ANN) for faster k-nearest neighbor search. An R interface to fast kNN and fixed-radius NN search is also provided. Hahsler, Piekenbrock and Doran (2019) . License: GPL (>= 2) URL: https://github.com/mhahsler/dbscan BugReports: https://github.com/mhahsler/dbscan/issues Depends: R (>= 3.2.0) Imports: generics, graphics, Rcpp (>= 1.0.0), stats Suggests: dendextend, fpc, igraph, knitr, microbenchmark, rmarkdown, testthat (>= 3.0.0), tibble LinkingTo: Rcpp VignetteBuilder: knitr Config/testthat/edition: 3 Copyright: ANN library is copyright by University of Maryland, Sunil Arya and David Mount. All other code is copyright by Michael Hahsler and Matthew Piekenbrock. Encoding: UTF-8 Roxygen: list(markdown = TRUE) RoxygenNote: 7.3.3 ================================================ FILE: LICENSE ================================================ GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007 Copyright (C) 2007 Free Software Foundation, Inc. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Preamble The GNU General Public License is a free, copyleft license for software and other kinds of works. The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change all versions of a program--to make sure it remains free software for all its users. We, the Free Software Foundation, use the GNU General Public License for most of our software; it applies also to any other work released this way by its authors. You can apply it to your programs, too. When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things. To protect your rights, we need to prevent others from denying you these rights or asking you to surrender the rights. Therefore, you have certain responsibilities if you distribute copies of the software, or if you modify it: responsibilities to respect the freedom of others. For example, if you distribute copies of such a program, whether gratis or for a fee, you must pass on to the recipients the same freedoms that you received. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights. Developers that use the GNU GPL protect your rights with two steps: (1) assert copyright on the software, and (2) offer you this License giving you legal permission to copy, distribute and/or modify it. For the developers' and authors' protection, the GPL clearly explains that there is no warranty for this free software. For both users' and authors' sake, the GPL requires that modified versions be marked as changed, so that their problems will not be attributed erroneously to authors of previous versions. Some devices are designed to deny users access to install or run modified versions of the software inside them, although the manufacturer can do so. This is fundamentally incompatible with the aim of protecting users' freedom to change the software. The systematic pattern of such abuse occurs in the area of products for individuals to use, which is precisely where it is most unacceptable. Therefore, we have designed this version of the GPL to prohibit the practice for those products. If such problems arise substantially in other domains, we stand ready to extend this provision to those domains in future versions of the GPL, as needed to protect the freedom of users. Finally, every program is threatened constantly by software patents. States should not allow patents to restrict development and use of software on general-purpose computers, but in those that do, we wish to avoid the special danger that patents applied to a free program could make it effectively proprietary. To prevent this, the GPL assures that patents cannot be used to render the program non-free. The precise terms and conditions for copying, distribution and modification follow. TERMS AND CONDITIONS 0. Definitions. "This License" refers to version 3 of the GNU General Public License. "Copyright" also means copyright-like laws that apply to other kinds of works, such as semiconductor masks. "The Program" refers to any copyrightable work licensed under this License. Each licensee is addressed as "you". "Licensees" and "recipients" may be individuals or organizations. To "modify" a work means to copy from or adapt all or part of the work in a fashion requiring copyright permission, other than the making of an exact copy. The resulting work is called a "modified version" of the earlier work or a work "based on" the earlier work. A "covered work" means either the unmodified Program or a work based on the Program. To "propagate" a work means to do anything with it that, without permission, would make you directly or secondarily liable for infringement under applicable copyright law, except executing it on a computer or modifying a private copy. Propagation includes copying, distribution (with or without modification), making available to the public, and in some countries other activities as well. To "convey" a work means any kind of propagation that enables other parties to make or receive copies. Mere interaction with a user through a computer network, with no transfer of a copy, is not conveying. An interactive user interface displays "Appropriate Legal Notices" to the extent that it includes a convenient and prominently visible feature that (1) displays an appropriate copyright notice, and (2) tells the user that there is no warranty for the work (except to the extent that warranties are provided), that licensees may convey the work under this License, and how to view a copy of this License. If the interface presents a list of user commands or options, such as a menu, a prominent item in the list meets this criterion. 1. Source Code. The "source code" for a work means the preferred form of the work for making modifications to it. "Object code" means any non-source form of a work. A "Standard Interface" means an interface that either is an official standard defined by a recognized standards body, or, in the case of interfaces specified for a particular programming language, one that is widely used among developers working in that language. The "System Libraries" of an executable work include anything, other than the work as a whole, that (a) is included in the normal form of packaging a Major Component, but which is not part of that Major Component, and (b) serves only to enable use of the work with that Major Component, or to implement a Standard Interface for which an implementation is available to the public in source code form. A "Major Component", in this context, means a major essential component (kernel, window system, and so on) of the specific operating system (if any) on which the executable work runs, or a compiler used to produce the work, or an object code interpreter used to run it. The "Corresponding Source" for a work in object code form means all the source code needed to generate, install, and (for an executable work) run the object code and to modify the work, including scripts to control those activities. However, it does not include the work's System Libraries, or general-purpose tools or generally available free programs which are used unmodified in performing those activities but which are not part of the work. For example, Corresponding Source includes interface definition files associated with source files for the work, and the source code for shared libraries and dynamically linked subprograms that the work is specifically designed to require, such as by intimate data communication or control flow between those subprograms and other parts of the work. The Corresponding Source need not include anything that users can regenerate automatically from other parts of the Corresponding Source. The Corresponding Source for a work in source code form is that same work. 2. Basic Permissions. All rights granted under this License are granted for the term of copyright on the Program, and are irrevocable provided the stated conditions are met. This License explicitly affirms your unlimited permission to run the unmodified Program. The output from running a covered work is covered by this License only if the output, given its content, constitutes a covered work. This License acknowledges your rights of fair use or other equivalent, as provided by copyright law. You may make, run and propagate covered works that you do not convey, without conditions so long as your license otherwise remains in force. You may convey covered works to others for the sole purpose of having them make modifications exclusively for you, or provide you with facilities for running those works, provided that you comply with the terms of this License in conveying all material for which you do not control copyright. Those thus making or running the covered works for you must do so exclusively on your behalf, under your direction and control, on terms that prohibit them from making any copies of your copyrighted material outside their relationship with you. Conveying under any other circumstances is permitted solely under the conditions stated below. Sublicensing is not allowed; section 10 makes it unnecessary. 3. Protecting Users' Legal Rights From Anti-Circumvention Law. No covered work shall be deemed part of an effective technological measure under any applicable law fulfilling obligations under article 11 of the WIPO copyright treaty adopted on 20 December 1996, or similar laws prohibiting or restricting circumvention of such measures. When you convey a covered work, you waive any legal power to forbid circumvention of technological measures to the extent such circumvention is effected by exercising rights under this License with respect to the covered work, and you disclaim any intention to limit operation or modification of the work as a means of enforcing, against the work's users, your or third parties' legal rights to forbid circumvention of technological measures. 4. Conveying Verbatim Copies. You may convey verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice; keep intact all notices stating that this License and any non-permissive terms added in accord with section 7 apply to the code; keep intact all notices of the absence of any warranty; and give all recipients a copy of this License along with the Program. You may charge any price or no price for each copy that you convey, and you may offer support or warranty protection for a fee. 5. Conveying Modified Source Versions. You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions: a) The work must carry prominent notices stating that you modified it, and giving a relevant date. b) The work must carry prominent notices stating that it is released under this License and any conditions added under section 7. This requirement modifies the requirement in section 4 to "keep intact all notices". c) You must license the entire work, as a whole, under this License to anyone who comes into possession of a copy. This License will therefore apply, along with any applicable section 7 additional terms, to the whole of the work, and all its parts, regardless of how they are packaged. This License gives no permission to license the work in any other way, but it does not invalidate such permission if you have separately received it. d) If the work has interactive user interfaces, each must display Appropriate Legal Notices; however, if the Program has interactive interfaces that do not display Appropriate Legal Notices, your work need not make them do so. A compilation of a covered work with other separate and independent works, which are not by their nature extensions of the covered work, and which are not combined with it such as to form a larger program, in or on a volume of a storage or distribution medium, is called an "aggregate" if the compilation and its resulting copyright are not used to limit the access or legal rights of the compilation's users beyond what the individual works permit. Inclusion of a covered work in an aggregate does not cause this License to apply to the other parts of the aggregate. 6. Conveying Non-Source Forms. You may convey a covered work in object code form under the terms of sections 4 and 5, provided that you also convey the machine-readable Corresponding Source under the terms of this License, in one of these ways: a) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by the Corresponding Source fixed on a durable physical medium customarily used for software interchange. b) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by a written offer, valid for at least three years and valid for as long as you offer spare parts or customer support for that product model, to give anyone who possesses the object code either (1) a copy of the Corresponding Source for all the software in the product that is covered by this License, on a durable physical medium customarily used for software interchange, for a price no more than your reasonable cost of physically performing this conveying of source, or (2) access to copy the Corresponding Source from a network server at no charge. c) Convey individual copies of the object code with a copy of the written offer to provide the Corresponding Source. This alternative is allowed only occasionally and noncommercially, and only if you received the object code with such an offer, in accord with subsection 6b. d) Convey the object code by offering access from a designated place (gratis or for a charge), and offer equivalent access to the Corresponding Source in the same way through the same place at no further charge. You need not require recipients to copy the Corresponding Source along with the object code. If the place to copy the object code is a network server, the Corresponding Source may be on a different server (operated by you or a third party) that supports equivalent copying facilities, provided you maintain clear directions next to the object code saying where to find the Corresponding Source. Regardless of what server hosts the Corresponding Source, you remain obligated to ensure that it is available for as long as needed to satisfy these requirements. e) Convey the object code using peer-to-peer transmission, provided you inform other peers where the object code and Corresponding Source of the work are being offered to the general public at no charge under subsection 6d. A separable portion of the object code, whose source code is excluded from the Corresponding Source as a System Library, need not be included in conveying the object code work. A "User Product" is either (1) a "consumer product", which means any tangible personal property which is normally used for personal, family, or household purposes, or (2) anything designed or sold for incorporation into a dwelling. In determining whether a product is a consumer product, doubtful cases shall be resolved in favor of coverage. For a particular product received by a particular user, "normally used" refers to a typical or common use of that class of product, regardless of the status of the particular user or of the way in which the particular user actually uses, or expects or is expected to use, the product. A product is a consumer product regardless of whether the product has substantial commercial, industrial or non-consumer uses, unless such uses represent the only significant mode of use of the product. "Installation Information" for a User Product means any methods, procedures, authorization keys, or other information required to install and execute modified versions of a covered work in that User Product from a modified version of its Corresponding Source. The information must suffice to ensure that the continued functioning of the modified object code is in no case prevented or interfered with solely because modification has been made. If you convey an object code work under this section in, or with, or specifically for use in, a User Product, and the conveying occurs as part of a transaction in which the right of possession and use of the User Product is transferred to the recipient in perpetuity or for a fixed term (regardless of how the transaction is characterized), the Corresponding Source conveyed under this section must be accompanied by the Installation Information. But this requirement does not apply if neither you nor any third party retains the ability to install modified object code on the User Product (for example, the work has been installed in ROM). The requirement to provide Installation Information does not include a requirement to continue to provide support service, warranty, or updates for a work that has been modified or installed by the recipient, or for the User Product in which it has been modified or installed. Access to a network may be denied when the modification itself materially and adversely affects the operation of the network or violates the rules and protocols for communication across the network. Corresponding Source conveyed, and Installation Information provided, in accord with this section must be in a format that is publicly documented (and with an implementation available to the public in source code form), and must require no special password or key for unpacking, reading or copying. 7. Additional Terms. "Additional permissions" are terms that supplement the terms of this License by making exceptions from one or more of its conditions. Additional permissions that are applicable to the entire Program shall be treated as though they were included in this License, to the extent that they are valid under applicable law. If additional permissions apply only to part of the Program, that part may be used separately under those permissions, but the entire Program remains governed by this License without regard to the additional permissions. When you convey a copy of a covered work, you may at your option remove any additional permissions from that copy, or from any part of it. (Additional permissions may be written to require their own removal in certain cases when you modify the work.) You may place additional permissions on material, added by you to a covered work, for which you have or can give appropriate copyright permission. Notwithstanding any other provision of this License, for material you add to a covered work, you may (if authorized by the copyright holders of that material) supplement the terms of this License with terms: a) Disclaiming warranty or limiting liability differently from the terms of sections 15 and 16 of this License; or b) Requiring preservation of specified reasonable legal notices or author attributions in that material or in the Appropriate Legal Notices displayed by works containing it; or c) Prohibiting misrepresentation of the origin of that material, or requiring that modified versions of such material be marked in reasonable ways as different from the original version; or d) Limiting the use for publicity purposes of names of licensors or authors of the material; or e) Declining to grant rights under trademark law for use of some trade names, trademarks, or service marks; or f) Requiring indemnification of licensors and authors of that material by anyone who conveys the material (or modified versions of it) with contractual assumptions of liability to the recipient, for any liability that these contractual assumptions directly impose on those licensors and authors. All other non-permissive additional terms are considered "further restrictions" within the meaning of section 10. If the Program as you received it, or any part of it, contains a notice stating that it is governed by this License along with a term that is a further restriction, you may remove that term. If a license document contains a further restriction but permits relicensing or conveying under this License, you may add to a covered work material governed by the terms of that license document, provided that the further restriction does not survive such relicensing or conveying. If you add terms to a covered work in accord with this section, you must place, in the relevant source files, a statement of the additional terms that apply to those files, or a notice indicating where to find the applicable terms. Additional terms, permissive or non-permissive, may be stated in the form of a separately written license, or stated as exceptions; the above requirements apply either way. 8. Termination. You may not propagate or modify a covered work except as expressly provided under this License. Any attempt otherwise to propagate or modify it is void, and will automatically terminate your rights under this License (including any patent licenses granted under the third paragraph of section 11). However, if you cease all violation of this License, then your license from a particular copyright holder is reinstated (a) provisionally, unless and until the copyright holder explicitly and finally terminates your license, and (b) permanently, if the copyright holder fails to notify you of the violation by some reasonable means prior to 60 days after the cessation. Moreover, your license from a particular copyright holder is reinstated permanently if the copyright holder notifies you of the violation by some reasonable means, this is the first time you have received notice of violation of this License (for any work) from that copyright holder, and you cure the violation prior to 30 days after your receipt of the notice. Termination of your rights under this section does not terminate the licenses of parties who have received copies or rights from you under this License. If your rights have been terminated and not permanently reinstated, you do not qualify to receive new licenses for the same material under section 10. 9. Acceptance Not Required for Having Copies. You are not required to accept this License in order to receive or run a copy of the Program. Ancillary propagation of a covered work occurring solely as a consequence of using peer-to-peer transmission to receive a copy likewise does not require acceptance. However, nothing other than this License grants you permission to propagate or modify any covered work. These actions infringe copyright if you do not accept this License. Therefore, by modifying or propagating a covered work, you indicate your acceptance of this License to do so. 10. Automatic Licensing of Downstream Recipients. Each time you convey a covered work, the recipient automatically receives a license from the original licensors, to run, modify and propagate that work, subject to this License. You are not responsible for enforcing compliance by third parties with this License. An "entity transaction" is a transaction transferring control of an organization, or substantially all assets of one, or subdividing an organization, or merging organizations. If propagation of a covered work results from an entity transaction, each party to that transaction who receives a copy of the work also receives whatever licenses to the work the party's predecessor in interest had or could give under the previous paragraph, plus a right to possession of the Corresponding Source of the work from the predecessor in interest, if the predecessor has it or can get it with reasonable efforts. You may not impose any further restrictions on the exercise of the rights granted or affirmed under this License. For example, you may not impose a license fee, royalty, or other charge for exercise of rights granted under this License, and you may not initiate litigation (including a cross-claim or counterclaim in a lawsuit) alleging that any patent claim is infringed by making, using, selling, offering for sale, or importing the Program or any portion of it. 11. Patents. A "contributor" is a copyright holder who authorizes use under this License of the Program or a work on which the Program is based. The work thus licensed is called the contributor's "contributor version". A contributor's "essential patent claims" are all patent claims owned or controlled by the contributor, whether already acquired or hereafter acquired, that would be infringed by some manner, permitted by this License, of making, using, or selling its contributor version, but do not include claims that would be infringed only as a consequence of further modification of the contributor version. For purposes of this definition, "control" includes the right to grant patent sublicenses in a manner consistent with the requirements of this License. Each contributor grants you a non-exclusive, worldwide, royalty-free patent license under the contributor's essential patent claims, to make, use, sell, offer for sale, import and otherwise run, modify and propagate the contents of its contributor version. In the following three paragraphs, a "patent license" is any express agreement or commitment, however denominated, not to enforce a patent (such as an express permission to practice a patent or covenant not to sue for patent infringement). To "grant" such a patent license to a party means to make such an agreement or commitment not to enforce a patent against the party. If you convey a covered work, knowingly relying on a patent license, and the Corresponding Source of the work is not available for anyone to copy, free of charge and under the terms of this License, through a publicly available network server or other readily accessible means, then you must either (1) cause the Corresponding Source to be so available, or (2) arrange to deprive yourself of the benefit of the patent license for this particular work, or (3) arrange, in a manner consistent with the requirements of this License, to extend the patent license to downstream recipients. "Knowingly relying" means you have actual knowledge that, but for the patent license, your conveying the covered work in a country, or your recipient's use of the covered work in a country, would infringe one or more identifiable patents in that country that you have reason to believe are valid. If, pursuant to or in connection with a single transaction or arrangement, you convey, or propagate by procuring conveyance of, a covered work, and grant a patent license to some of the parties receiving the covered work authorizing them to use, propagate, modify or convey a specific copy of the covered work, then the patent license you grant is automatically extended to all recipients of the covered work and works based on it. A patent license is "discriminatory" if it does not include within the scope of its coverage, prohibits the exercise of, or is conditioned on the non-exercise of one or more of the rights that are specifically granted under this License. You may not convey a covered work if you are a party to an arrangement with a third party that is in the business of distributing software, under which you make payment to the third party based on the extent of your activity of conveying the work, and under which the third party grants, to any of the parties who would receive the covered work from you, a discriminatory patent license (a) in connection with copies of the covered work conveyed by you (or copies made from those copies), or (b) primarily for and in connection with specific products or compilations that contain the covered work, unless you entered into that arrangement, or that patent license was granted, prior to 28 March 2007. Nothing in this License shall be construed as excluding or limiting any implied license or other defenses to infringement that may otherwise be available to you under applicable patent law. 12. No Surrender of Others' Freedom. If conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot convey a covered work so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not convey it at all. For example, if you agree to terms that obligate you to collect a royalty for further conveying from those to whom you convey the Program, the only way you could satisfy both those terms and this License would be to refrain entirely from conveying the Program. 13. Use with the GNU Affero General Public License. Notwithstanding any other provision of this License, you have permission to link or combine any covered work with a work licensed under version 3 of the GNU Affero General Public License into a single combined work, and to convey the resulting work. The terms of this License will continue to apply to the part which is the covered work, but the special requirements of the GNU Affero General Public License, section 13, concerning interaction through a network will apply to the combination as such. 14. Revised Versions of this License. The Free Software Foundation may publish revised and/or new versions of the GNU General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program specifies that a certain numbered version of the GNU General Public License "or any later version" applies to it, you have the option of following the terms and conditions either of that numbered version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of the GNU General Public License, you may choose any version ever published by the Free Software Foundation. If the Program specifies that a proxy can decide which future versions of the GNU General Public License can be used, that proxy's public statement of acceptance of a version permanently authorizes you to choose that version for the Program. Later license versions may give you additional or different permissions. However, no additional obligations are imposed on any author or copyright holder as a result of your choosing to follow a later version. 15. Disclaimer of Warranty. THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 16. Limitation of Liability. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. 17. Interpretation of Sections 15 and 16. If the disclaimer of warranty and limitation of liability provided above cannot be given local legal effect according to their terms, reviewing courts shall apply local law that most closely approximates an absolute waiver of all civil liability in connection with the Program, unless a warranty or assumption of liability accompanies a copy of the Program in return for a fee. END OF TERMS AND CONDITIONS How to Apply These Terms to Your New Programs If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms. To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively state the exclusion of warranty; and each file should have at least the "copyright" line and a pointer to where the full notice is found. {one line to give the program's name and a brief idea of what it does.} Copyright (C) {year} {name of author} This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . Also add information on how to contact you by electronic and paper mail. If the program does terminal interaction, make it output a short notice like this when it starts in an interactive mode: {project} Copyright (C) {year} {fullname} This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'. This is free software, and you are welcome to redistribute it under certain conditions; type `show c' for details. The hypothetical commands `show w' and `show c' should show the appropriate parts of the General Public License. Of course, your program's commands might be different; for a GUI interface, you would use an "about box". You should also get your employer (if you work as a programmer) or school, if any, to sign a "copyright disclaimer" for the program, if necessary. For more information on this, and how to apply and follow the GNU GPL, see . The GNU General Public License does not permit incorporating your program into proprietary programs. If your program is a subroutine library, you may consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Lesser General Public License instead of this License. But first, please read . ================================================ FILE: NAMESPACE ================================================ # Generated by roxygen2: do not edit by hand S3method(adjacencylist,NN) S3method(adjacencylist,frNN) S3method(adjacencylist,kNN) S3method(as.dendrogram,default) S3method(as.dendrogram,hclust) S3method(as.dendrogram,hdbscan) S3method(as.dendrogram,optics) S3method(as.dendrogram,reachability) S3method(as.reachability,dendrogram) S3method(as.reachability,optics) S3method(augment,dbscan) S3method(augment,general_clustering) S3method(augment,hdbscan) S3method(comps,dist) S3method(comps,frNN) S3method(comps,kNN) S3method(comps,sNN) S3method(glance,dbscan) S3method(glance,general_clustering) S3method(glance,hdbscan) S3method(ncluster,default) S3method(nnoise,default) S3method(nobs,dbscan) S3method(nobs,general_clustering) S3method(nobs,hdbscan) S3method(plot,NN) S3method(plot,hdbscan) S3method(plot,optics) S3method(plot,reachability) S3method(predict,dbscan_fast) S3method(predict,hdbscan) S3method(predict,optics) S3method(print,dbscan_fast) S3method(print,frNN) S3method(print,general_clustering) S3method(print,hdbscan) S3method(print,kNN) S3method(print,optics) S3method(print,reachability) S3method(print,sNN) S3method(sort,NN) S3method(sort,frNN) S3method(sort,kNN) S3method(sort,sNN) S3method(tidy,dbscan) S3method(tidy,general_clustering) S3method(tidy,hdbscan) export(adjacencylist) export(as.dendrogram) export(as.reachability) export(augment) export(clplot) export(comps) export(coredist) export(dbcv) export(dbscan) export(extractDBSCAN) export(extractFOSC) export(extractXi) export(frNN) export(glance) export(glosh) export(hdbscan) export(hullplot) export(is.corepoint) export(jpclust) export(kNN) export(kNNdist) export(kNNdistplot) export(lof) export(mrdist) export(ncluster) export(nnoise) export(optics) export(pointdensity) export(sNN) export(sNNclust) export(tidy) import(Rcpp) importFrom(generics,augment) importFrom(generics,glance) importFrom(generics,tidy) importFrom(grDevices,adjustcolor) importFrom(grDevices,chull) importFrom(grDevices,palette) importFrom(graphics,abline) importFrom(graphics,lines) importFrom(graphics,matplot) importFrom(graphics,par) importFrom(graphics,plot) importFrom(graphics,points) importFrom(graphics,polygon) importFrom(graphics,segments) importFrom(graphics,text) importFrom(stats,as.dendrogram) importFrom(stats,dendrapply) importFrom(stats,dist) importFrom(stats,hclust) importFrom(stats,is.leaf) importFrom(stats,nobs) importFrom(stats,prcomp) importFrom(stats,predict) importFrom(utils,tail) useDynLib(dbscan, .registration=TRUE) ================================================ FILE: NEWS.md ================================================ # dbscan 1.2.4 (2025-12-18) ## Bugfixes * dbscan now checks for matrices with 0 rows or 0 columns (reported by maldridgeepa). * Fixed license information for the ANN library header files (reported by Charles Plessy). # dbscan 1.2.3 (2025-08-20) ## Bugfixes * plot.hdbscan gained parameters main, ylab, and leaflab (reported by nhward). ## Changes * Fixed partial argument matches. # dbscan 1.2.2 (2025-01-24) ## Changes * Removed dependence on the /bits/stdc++.h header. # dbscan 1.2.1 (2025-01-23) ## Changes * Various refactoring by m-muecke ## New Features * HDBSCAN gained parameter cluster_selection_epsilon to implement clusters selected from Malzer and Baum (2020). * Functions ncluster() and nnoise() were added. * hullplot now() marks noise as x. * Added clplot(). * pointdensity now also accepts a dist object as input and has the new type "gaussian" to calculate a Gaussian kernel estimate. * Added the DBCV index. ## Bugfixes * extractFOCS: Fixed total_score. * Rewrote minimal spanning tree code. # dbscan 1.2-0 (2024-06-28) ## New Features * dbscan has now tidymodels tidiers (glance, tidy, augment). * kNNdistplot can now plot a range of k/minPts values. * added stats::nobs methods for the clusterings. * kNN and frNN now contains the used distance metric. ## Changes * dbscan component dist was renamed to metric. * Removed redundant sort in kNNdistplot (reported by Natasza Szczypien). * Refactoring use anyNA(x) instead of any(is.na(x)) and many more (by m-muecke). * Reorganized the C++ source code. * README now uses bibtex. * Tests use now testthat edition 3 (m-muecke). # dbscan 1.1-12 (2023-11-28) ## Bugfixes * point_density checks now for missing values (reported by soelderer). * Removed C++11 specification. * ANN.cpp: fixed Rprintf warning. # dbscan 1.1-11 (2022-10-26) ## New Features * kNNdistplot gained parameter minPts. * dbscan now retains information on distance method and border points. * HDBSCAN now supports long vectors to work with larger distance matrices. * conversion from dist to kNN and frNN is now more memory efficient. It does no longer coerce the dist object into a matrix of double the size, but extract the distances directly from the dist object. * Better description of how predict uses only Euclidean distances and more error checking. * The package now exports a new generic for as.dendrogram(). ## Bugfixes * is.corepoint() now uses the correct epsilon value (reported by Eng Aun). * functions now check for cluster::dissimilariy objects which have class dist but missing attributes. # dbscan 1.1-10 (2022-01-14) ## New Features * is.corepoint() for DBSCAN. * coredist() and mrdist() for HDBSCAN. * find connected components with comps(). ## Changes * reachability plot now shows all undefined distances as a dashed line. ## Bugfixes * memory leak in mrd calculation fixed. # dbscan 1.1-9 (2022-01-10) ## Changes * We use now roxygen2. ## New Features * Added predict for hdbscan (as suggested by moredatapls) # dbscan 1.1-8 (2021-04-26) ## Bugfixes * LOF: fixed numerical issues with k-nearest neighbor distance on Solaris. # dbscan 1.1-7 (2021-04-21) ## Bugfixes * Fixed description of k in knndistplot and added minPts argument. * Fixed bug for tied distances in lof (reported by sverchkov). ## Changes * lof: the density parameter was changes to minPts to be consistent with the original paper and dbscan. Note that minPts = k + 1. # dbscan 1.1-6 (2021-02-24) ## Improvements * Improved speed of LOF for large ks (following suggestions by eduardokapp). * kNN: results is now not sorted again for kd-tree queries which is much faster (by a factor of 10). * ANN library: annclose() is now only called once when the package is unloaded. This is in preparation to support persistent kd-trees using external pointers. * hdbscan lost parameter xdist. ## Bugfixes * removed dependence on methods. * fixed problem in hullplot for singleton clusters (reported by Fernando Archuby). * GLOSH now also accepts data.frames. * GLOSH returns now 0 instead of NaN if we have k duplicate points in the data. # dbscan 1.1-5 (2019-10-22) ## New Features * kNN and frNN gained parameter query to query neighbors for points not in the data. * sNN gained parameter jp to decide if the shared NN should be counted using the definition by Jarvis and Patrick. # dbscan 1.1-4 (2019-08-05) ## New Features * kNNdist gained parameter all to indicate if a matrix with the distance to all nearest neighbors up to k should be returned. ## Bugfixes * kNNdist now correctly returns the distances to the kth neighbor (reported by zschuster). * dbscan: check eps and minPts parameters to avoid undefined results (reported by ArthurPERE). # dbscan 1.1-3 (2018-11-12) ## Bugfixes * pointdensity was double counting the query point (reported by Marius Hofert). # dbscan 1.1-2 (2018-05-18) ## New Features * OPTICS now calculates eps if it is omitted. ## Bugfixes * Example now only uses igraph conditionally since it is unavailable on Solaris (reported by B. Ripley). # dbscan 1.1-1 (2017-03-19) ## Bugfixes * Fixed problem with constant name on Solaris in ANN code (reported by B. Ripley). # dbscan 1.1-0 (2017-03-18) ## New Features * HDBSCAN was added. * extractFOSC (optimal selection of clusters for HDBSCAN) was added. * GLOSH outlier score was added. * hullplot uses now filled polygons as the default. * hullplot now used PCA if the data has more than 2 dimensions. * Added NN superclass for kNN and frNN with plot and with adjacencylist(). * Added shared nearest neighbor clustering as sNNclust() and sNN to calculate the number of shared nearest neighbors. * Added pointdensity function. * Unsorted kNN and frNN can now be sorted using sort(). * kNN and frNN now also accept kNN and frNN objects, respectively. This can be used to create a new kNN (frNN) with a reduced k or eps. * Datasets added: DS3 and moon. ## Interface Changes * Improved interface for dbscan() and optics(): ... it now passed on to frNN. * OPTICS clustering extraction methods are now called extractDBSCAN and extractXi. * kNN and frNN are now objects with a print function. * dbscan now also accepts a frNN object as input. * jpclust and sNNclust now return a list instead of just the cluster assignments. # dbscan 1.0-0 (2017-02-02) ## New Features * The package has now a vignette. * Jarvis-Patrick clustering is now available as jpclust(). * Improved interface for dbscan() and optics(): ... is now passed on to frNN. * OPTICS clustering extraction methods are now called extractDBSCAN and extractXi. * hullplot uses now filled polygons as the default. * hullplot now used PCA if the data has more than 2 dimensions. * kNN and frNN are now objects with a print function. * dbscan now also accepts a frNN object as input. # dbscan 0.9-8 (2016-08-05) ## New Features * Added hullplot to plot a scatter plot with added convex cluster hulls. * OPTICS: added a predecessor correction step that is used by the ELKI implementation (Matt Piekenbrock). ## Bugfixes * Fixed a memory problem in frNN (reported by Yilei He). # dbscan 0.9-7 (2016-04-14) * OPTICSXi is now implemented (thanks to Matt Piekenbrock). * DBSCAN now also accepts MinPts (with a capital M) to be compatible with the fpc version. * DBSCAN objects are now also of class db scan_fast to avoid clashes with fpc. * DBSCAN and OPTICS have now predict functions. * Added test for unhandled NAs. * Fixed LOF for more than k duplicate points (reported by Samneet Singh). # dbscan 0.9-6 (2015-12-14) * OPTICS: fixed second bug reported by Di Pang * all methods now also accept dist objects and have a search method "dist" which precomputes distances. # dbscan 0.9-5 (2015-10-04) * OPTICS: fixed bug with first observation reported by Di Pang * OPTICS: clusterings can now be extracted using optics_cut # dbscan 0.9-4 (2015-09-17) * added tests (testthat). * input data is now checked if it can safely be coerced into a numeric matrix (storage.mode double). * fixed self matches in kNN and frNN (now returns the first NN correctly). # dbscan 0.9-3 (2015-9-2) * Added weights to DBSCAN. # dbscan 0.9-2 (2015-08-11) * Added kNN interface. * Added frNN (fixed radius NN) interface. * Added LOF. * Added OPTICS. * All algorithms check now for interrupt (CTRL-C/Esc). * DBSCAN now returns a list instead of a numeric vector. # dbscan 0.9-1 (2015-07-21) * DBSCAN: Improved speed by avoiding repeated sorting of point ids. * Added linear NN search option. * Added fast calculation for kNN distance. * fpc and microbenchmark are now used conditionally in the examples. # dbscan 0.9-0 (2015-07-15) * initial release ================================================ FILE: R/AAA_dbscan-package.R ================================================ #' @keywords internal #' #' @section Key functions: #' - Clustering: [dbscan()], [hdbscan()], [optics()], [jpclust()], [sNNclust()] #' - Outliers: [lof()], [glosh()], [pointdensity()] #' - Nearest Neighbors: [kNN()], [frNN()], [sNN()] #' #' @references #' Hahsler M, Piekenbrock M, Doran D (2019). dbscan: Fast Density-Based Clustering with R. Journal of Statistical Software, 91(1), 1-30. \doi{10.18637/jss.v091.i01} #' #' @import Rcpp #' @importFrom graphics plot points lines text abline polygon par segments matplot #' @importFrom grDevices palette chull adjustcolor #' @importFrom stats dist hclust dendrapply as.dendrogram is.leaf prcomp #' @importFrom utils tail #' #' @useDynLib dbscan, .registration=TRUE "_PACKAGE" ================================================ FILE: R/AAA_definitions.R ================================================ ####################################################################### # dbscan - Density Based Clustering of Applications with Noise # and Related Algorithms # Copyright (C) 2015 Michael Hahsler # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License along # with this program; if not, write to the Free Software Foundation, Inc., # 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. .ANNsplitRule <- c("STD", "MIDPT", "FAIR", "SL_MIDPT", "SL_FAIR", "SUGGEST") .matrixlike <- function(x) { if (is.null(dim(x))) return(FALSE) # check that there is at least one row and one column! if (nrow(x) < 1L) stop("the provided data has 0 rows!") if (ncol(x) < 1L) stop("the provided data has 0 columns!") TRUE } ================================================ FILE: R/DBCV_datasets.R ================================================ ####################################################################### # dbscan - Density Based Clustering of Applications with Noise # and Related Algorithms # Copyright (C) 2015 Michael Hahsler # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License along # with this program; if not, write to the Free Software Foundation, Inc., # 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. #' DBCV Paper Datasets #' #' The four synthetic 2D datasets used in Moulavi et al (2014). #' #' @name DBCV_datasets #' @aliases Dataset_1 Dataset_2 Dataset_3 Dataset_4 #' @docType data #' @format Four data frames with the following 3 variables. #' \describe{ #' \item{x}{a numeric vector} #' \item{y}{a numeric vector} #' \item{class}{an integer vector indicating the class label. 0 means noise.} } #' @references Davoud Moulavi and Pablo A. Jaskowiak and #' Ricardo J. G. B. Campello and Arthur Zimek and Jörg Sander (2014). #' Density-Based Clustering Validation. In #' _Proceedings of the 2014 SIAM International Conference on Data Mining,_ #' pages 839-847 #' \doi{10.1137/1.9781611973440.96} #' @source https://github.com/pajaskowiak/dbcv #' @keywords datasets #' @examples #' data("Dataset_1") #' clplot(Dataset_1[, c("x", "y")], cl = Dataset_1$class) #' #' data("Dataset_2") #' clplot(Dataset_2[, c("x", "y")], cl = Dataset_2$class) #' #' data("Dataset_3") #' clplot(Dataset_3[, c("x", "y")], cl = Dataset_3$class) #' #' data("Dataset_4") #' clplot(Dataset_4[, c("x", "y")], cl = Dataset_4$class) NULL ================================================ FILE: R/DS3.R ================================================ ####################################################################### # dbscan - Density Based Clustering of Applications with Noise # and Related Algorithms # Copyright (C) 2015 Michael Hahsler # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License along # with this program; if not, write to the Free Software Foundation, Inc., # 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. #' DS3: Spatial data with arbitrary shapes #' #' Contains 8000 2-d points, with 6 "natural" looking shapes, all of which have #' an sinusoid-like shape that intersects with each cluster. #' The data set was originally used as a benchmark data set for the Chameleon clustering #' algorithm (Karypis, Han and Kumar, 1999) to #' illustrate the a data set containing arbitrarily shaped #' spatial data surrounded by both noise and artifacts. #' #' @name DS3 #' @docType data #' @format A data.frame with 8000 observations on the following 2 columns: #' \describe{ #' \item{X}{a numeric vector} #' \item{Y}{a numeric vector} #' } #' #' @references Karypis, George, Eui-Hong Han, and Vipin Kumar (1999). #' Chameleon: Hierarchical clustering using dynamic modeling. _Computer_ #' 32(8): 68-75. #' @source Obtained from \url{http://cs.joensuu.fi/sipu/datasets/} #' @keywords datasets #' @examples #' data(DS3) #' plot(DS3, pch = 20, cex = 0.25) NULL ================================================ FILE: R/GLOSH.R ================================================ ####################################################################### # dbscan - Density Based Clustering of Applications with Noise # and Related Algorithms # Copyright (C) 2015 Michael Hahsler, Matthew Piekenbrock # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License along # with this program; if not, write to the Free Software Foundation, Inc., # 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. #' Global-Local Outlier Score from Hierarchies #' #' Calculate the Global-Local Outlier Score from Hierarchies (GLOSH) score for #' each data point using a kd-tree to speed up kNN search. #' #' GLOSH compares the density of a point to densities of any points associated #' within current and child clusters (if any). Points that have a substantially #' lower density than the density mode (cluster) they most associate with are #' considered outliers. GLOSH is computed from a hierarchy a clusters. #' #' Specifically, consider a point \emph{x} and a density or distance threshold #' \emph{lambda}. GLOSH is calculated by taking 1 minus the ratio of how long #' any of the child clusters of the cluster \emph{x} belongs to "survives" #' changes in \emph{lambda} to the highest \emph{lambda} threshold of x, above #' which x becomes a noise point. #' #' Scores close to 1 indicate outliers. For more details on the motivation for #' this calculation, see Campello et al (2015). #' #' @aliases glosh GLOSH #' @family Outlier Detection Functions #' #' @param x an [hclust] object, data matrix, or [dist] object. #' @param k size of the neighborhood. #' @param ... further arguments are passed on to [kNN()]. #' @return A numeric vector of length equal to the size of the original data #' set containing GLOSH values for all data points. #' @author Matt Piekenbrock #' #' @references Campello, Ricardo JGB, Davoud Moulavi, Arthur Zimek, and Joerg #' Sander. Hierarchical density estimates for data clustering, visualization, #' and outlier detection. _ACM Transactions on Knowledge Discovery from Data #' (TKDD)_ 10, no. 1 (2015). #' \doi{10.1145/2733381} #' @keywords model #' @examples #' set.seed(665544) #' n <- 100 #' x <- cbind( #' x=runif(10, 0, 5) + rnorm(n, sd = 0.4), #' y=runif(10, 0, 5) + rnorm(n, sd = 0.4) #' ) #' #' ### calculate GLOSH score #' glosh <- glosh(x, k = 3) #' #' ### distribution of outlier scores #' summary(glosh) #' hist(glosh, breaks = 10) #' #' ### simple function to plot point size is proportional to GLOSH score #' plot_glosh <- function(x, glosh){ #' plot(x, pch = ".", main = "GLOSH (k = 3)") #' points(x, cex = glosh*3, pch = 1, col = "red") #' text(x[glosh > 0.80, ], labels = round(glosh, 3)[glosh > 0.80], pos = 3) #' } #' plot_glosh(x, glosh) #' #' ### GLOSH with any hierarchy #' x_dist <- dist(x) #' x_sl <- hclust(x_dist, method = "single") #' x_upgma <- hclust(x_dist, method = "average") #' x_ward <- hclust(x_dist, method = "ward.D2") #' #' ## Compare what different linkage criterion consider as outliers #' glosh_sl <- glosh(x_sl, k = 3) #' plot_glosh(x, glosh_sl) #' #' glosh_upgma <- glosh(x_upgma, k = 3) #' plot_glosh(x, glosh_upgma) #' #' glosh_ward <- glosh(x_ward, k = 3) #' plot_glosh(x, glosh_ward) #' #' ## GLOSH is automatically computed with HDBSCAN #' all(hdbscan(x, minPts = 3)$outlier_scores == glosh(x, k = 3)) #' @export glosh <- function(x, k = 4, ...) { if (inherits(x, "data.frame")) x <- as.matrix(x) # get n if (inherits(x, "dist") || inherits(x, "matrix")) { if (inherits(x, "dist")) n <- attr(x, "Size") else n <- nrow(x) # get k nearest neighbors + distances d <- kNN(x, k - 1, ...) x_dist <- if (inherits(x, "dist")) x else dist(x, method = "euclidean") # copy since mrd changes by reference! .check_dist(x_dist) mrd <- mrd(x_dist, d$dist[, k - 1]) # need to assemble hclust object manually mst <- mst(mrd, n) hc <- hclustMergeOrder(mst, order(mst[, 3])) } else if (inherits(x, "hclust")) { hc <- x n <- nrow(hc$merge) + 1 } else stop("x needs to be a matrix, dist, or hclust object!") if (k < 2 || k >= n) stop("k has to be larger than 1 and smaller than the number of points") res <- computeStability(hc, k, compute_glosh = TRUE) # return attr(res, "glosh") } ================================================ FILE: R/LOF.R ================================================ ####################################################################### # dbscan - Density Based Clustering of Applications with Noise # and Related Algorithms # Copyright (C) 2015 Michael Hahsler # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License along # with this program; if not, write to the Free Software Foundation, Inc., # 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. #' Local Outlier Factor Score #' #' Calculate the Local Outlier Factor (LOF) score for each data point using a #' kd-tree to speed up kNN search. #' #' LOF compares the local readability density (lrd) of an point to the lrd of #' its neighbors. A LOF score of approximately 1 indicates that the lrd around #' the point is comparable to the lrd of its neighbors and that the point is #' not an outlier. Points that have a substantially lower lrd than their #' neighbors are considered outliers and produce scores significantly larger #' than 1. #' #' If a data matrix is specified, then Euclidean distances and fast nearest #' neighbor search using a kd-tree is used. #' #' **Note on duplicate points:** If there are more than `minPts` #' duplicates of a point in the data, then LOF the local readability distance #' will be 0 resulting in an undefined LOF score of 0/0. We set LOF in this #' case to 1 since there is already enough density from the points in the same #' location to make them not outliers. The original paper by Breunig et al #' (2000) assumes that the points are real duplicates and suggests to remove #' the duplicates before computing LOF. If duplicate points are removed first, #' then this LOF implementation in \pkg{dbscan} behaves like the one described #' by Breunig et al. #' #' @aliases lof LOF #' @family Outlier Detection Functions #' #' @param x a data matrix or a [dist] object. #' @param minPts number of nearest neighbors used in defining the local #' neighborhood of a point (includes the point itself). #' @param ... further arguments are passed on to [kNN()]. #' Note: `sort` cannot be specified here since `lof()` #' uses always `sort = TRUE`. #' #' @return A numeric vector of length `ncol(x)` containing LOF values for #' all data points. #' #' @author Michael Hahsler #' @references Breunig, M., Kriegel, H., Ng, R., and Sander, J. (2000). LOF: #' identifying density-based local outliers. In _ACM Int. Conf. on #' Management of Data,_ pages 93-104. #' \doi{10.1145/335191.335388} #' @keywords model #' @examples #' set.seed(665544) #' n <- 100 #' x <- cbind( #' x=runif(10, 0, 5) + rnorm(n, sd = 0.4), #' y=runif(10, 0, 5) + rnorm(n, sd = 0.4) #' ) #' #' ### calculate LOF score with a neighborhood of 3 points #' lof <- lof(x, minPts = 3) #' #' ### distribution of outlier factors #' summary(lof) #' hist(lof, breaks = 10, main = "LOF (minPts = 3)") #' #' ### plot sorted lof. Looks like outliers start arounf a LOF of 2. #' plot(sort(lof), type = "l", main = "LOF (minPts = 3)", #' xlab = "Points sorted by LOF", ylab = "LOF") #' #' ### point size is proportional to LOF and mark points with a LOF > 2 #' plot(x, pch = ".", main = "LOF (minPts = 3)", asp = 1) #' points(x, cex = (lof - 1) * 2, pch = 1, col = "red") #' text(x[lof > 2,], labels = round(lof, 1)[lof > 2], pos = 3) #' @export lof <- function(x, minPts = 5, ...) { ### parse extra parameters extra <- list(...) # check for deprecated k if (!is.null(extra[["k"]])) { minPts <- extra[["k"]] + 1 extra[["k"]] <- NULL warning("lof: k is now deprecated. use minPts = ", minPts, " instead .") } args <- c("search", "bucketSize", "splitRule", "approx") m <- pmatch(names(extra), args) if (anyNA(m)) stop("Unknown parameter: ", toString(names(extra)[is.na(m)])) names(extra) <- args[m] search <- extra$search %||% "kdtree" search <- .parse_search(search) splitRule <- extra$splitRule %||% "suggest" splitRule <- .parse_splitRule(splitRule) bucketSize <- if (is.null(extra$bucketSize)) 10L else as.integer(extra$bucketSize) approx <- if (is.null(extra$approx)) 0 else as.double(extra$approx) ### precompute distance matrix for dist search if (search == 3 && !inherits(x, "dist")) { if (.matrixlike(x)) x <- dist(x) else stop("x needs to be a matrix to calculate distances") } # get and check n if (inherits(x, "dist")) n <- attr(x, "Size") else n <- nrow(x) if (is.null(n)) stop("x needs to be a matrix or a dist object!") if (minPts < 2 || minPts > n) stop("minPts has to be at least 2 and not larger than the number of points") ### get LOF from a dist object if (inherits(x, "dist")) { if (anyNA(x)) stop("NAs not allowed in dist for LOF!") # find k-NN distance, ids and distances x <- as.matrix(x) diag(x) <- Inf ### no self-matches o <- t(apply(x, 1, order, decreasing = FALSE)) k_dist <- x[cbind(o[, minPts - 1], seq_len(n))] ids <- lapply( seq_len(n), FUN = function(i) which(x[i,] <= k_dist[i]) ) dist <- lapply( seq_len(n), FUN = function(i) x[i, x[i,] <= k_dist[i]] ) ret <- list(k_dist = k_dist, ids = ids, dist = dist) } else{ ### Use kd-tree if (anyNA(x)) stop("NAs not allowed for LOF using kdtree!") ret <- lof_kNN( as.matrix(x), as.integer(minPts), as.integer(search), as.integer(bucketSize), as.integer(splitRule), as.double(approx) ) } # calculate local reachability density (LRD) # reachability-distance_k(A,B) = max{k-distance(B), d(A,B)} # lrdk(A) = 1/(sum_B \in N_k(A) reachability-distance_k(A, B) / |N_k(A)|) lrd <- numeric(n) for (A in seq_len(n)) { Bs <- ret$ids[[A]] lrd[A] <- 1 / (sum(pmax.int(ret$k_dist[Bs], ret$dist[[A]])) / length(Bs)) } # calculate local outlier factor (LOF) # LOF_k(A) = sum_B \in N_k(A) lrd_k(B)/(|N_k(A)| lrdk(A)) lof <- numeric(n) for (A in seq_len(n)) { Bs <- ret$ids[[A]] lof[A] <- sum(lrd[Bs]) / length(Bs) / lrd[A] } # with more than k duplicates lrd can become infinity # we define them not to be outliers lof[is.nan(lof)] <- 1 lof } ================================================ FILE: R/NN.R ================================================ ####################################################################### # dbscan - Density Based Clustering of Applications with Noise # and Related Algorithms # Copyright (C) 2015 Michael Hahsler # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License along # with this program; if not, write to the Free Software Foundation, Inc., # 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. #' NN --- Nearest Neighbors Superclass #' #' NN is an abstract S3 superclass for the classes of the objects returned #' by [kNN()], [frNN()] and [sNN()]. Methods for sorting, plotting and getting an #' adjacency list are defined. #' #' @name NN #' @aliases NN #' @family NN functions #' #' @param x a `NN` object #' @param pch plotting character. #' @param col color used for the data points (nodes). #' @param linecol color used for edges. #' @param ... further parameters past on to [plot()]. #' @param decreasing sort in decreasing order? #' @param data that was used to create `x` #' @param main title #' #' @section Subclasses: #' [kNN], [frNN] and [sNN] #' #' @author Michael Hahsler #' @keywords model #' @examples #' data(iris) #' x <- iris[, -5] #' #' # finding kNN directly in data (using a kd-tree) #' nn <- kNN(x, k=5) #' nn #' #' # plot the kNN where NN are shown as line conecting points. #' plot(nn, x) #' #' # show the first few elements of the adjacency list #' head(adjacencylist(nn)) #' #' \dontrun{ #' # create a graph and find connected components (if igraph is installed) #' library("igraph") #' g <- graph_from_adj_list(adjacencylist(nn)) #' comp <- components(g) #' plot(x, col = comp$membership) #' #' # detect clusters (communities) with the label propagation algorithm #' cl <- membership(cluster_label_prop(g)) #' plot(x, col = cl) #' } NULL #' @rdname NN #' @export adjacencylist <- function (x, ...) UseMethod("adjacencylist", x) #' @rdname NN #' @export adjacencylist.NN <- function (x, ...) { stop("needs to be implemented by a subclass") } #' @rdname NN #' @export sort.NN <- function(x, decreasing = FALSE, ...) { stop("needs to be implemented by a subclass") } #' @rdname NN #' @export plot.NN <- function(x, data, main = NULL, pch = 16, col = NULL, linecol = "gray", ...) { if (is.null(main)) { if (inherits(x, "frNN")) main <- paste0("frNN graph (eps = ", x$eps, ")") if (inherits(x, "kNN")) main <- paste0(x$k, "-NN graph") if (inherits(x, "sNN")) main <- paste0("Shared NN graph (k=", x$k, ifelse(is.null(x$kt), "", paste0(", kt=", x$kt)), ")") } ## create an empty plot plot(data[, 1:2], main = main, type = "n", pch = pch, col = col, ...) id <- adjacencylist(x) ## use lines if it is from the same data ## FIXME: this test is not perfect, maybe we should have a parameter here or add the query points... if (length(id) == nrow(data)) { for (i in seq_along(id)) { for (j in seq_along(id[[i]])) lines(x = c(data[i, 1], data[id[[i]][j], 1]), y = c(data[i, 2], data[id[[i]][j], 2]), col = linecol, ...) } ## ad vertices points(data[, 1:2], main = main, pch = pch, col = col, ...) } else { ## ad vertices points(data[, 1:2], main = main, pch = pch, ...) ## use colors if it was from a query for (i in seq_along(id)) { points(data[id[[i]], ], pch = pch, col = i + 1L) } } } ================================================ FILE: R/RcppExports.R ================================================ # Generated by using Rcpp::compileAttributes() -> do not edit by hand # Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393 JP_int <- function(nn, kt) { .Call(`_dbscan_JP_int`, nn, kt) } SNN_sim_int <- function(nn, jp) { .Call(`_dbscan_SNN_sim_int`, nn, jp) } ANN_cleanup <- function() { invisible(.Call(`_dbscan_ANN_cleanup`)) } comps_kNN <- function(nn, mutual) { .Call(`_dbscan_comps_kNN`, nn, mutual) } comps_frNN <- function(nn, mutual) { .Call(`_dbscan_comps_frNN`, nn, mutual) } intToStr <- function(iv) { .Call(`_dbscan_intToStr`, iv) } dist_subset <- function(dist, idx) { .Call(`_dbscan_dist_subset`, dist, idx) } XOR <- function(lhs, rhs) { .Call(`_dbscan_XOR`, lhs, rhs) } dspc <- function(cl_idx, internal_nodes, all_cl_ids, mrd_dist) { .Call(`_dbscan_dspc`, cl_idx, internal_nodes, all_cl_ids, mrd_dist) } dbscan_int <- function(data, eps, minPts, weights, borderPoints, type, bucketSize, splitRule, approx, frNN) { .Call(`_dbscan_dbscan_int`, data, eps, minPts, weights, borderPoints, type, bucketSize, splitRule, approx, frNN) } reach_to_dendrogram <- function(reachability, pl_order) { .Call(`_dbscan_reach_to_dendrogram`, reachability, pl_order) } dendrogram_to_reach <- function(x) { .Call(`_dbscan_dendrogram_to_reach`, x) } mst_to_dendrogram <- function(mst) { .Call(`_dbscan_mst_to_dendrogram`, mst) } dbscan_density_int <- function(data, eps, type, bucketSize, splitRule, approx) { .Call(`_dbscan_dbscan_density_int`, data, eps, type, bucketSize, splitRule, approx) } frNN_int <- function(data, eps, type, bucketSize, splitRule, approx) { .Call(`_dbscan_frNN_int`, data, eps, type, bucketSize, splitRule, approx) } frNN_query_int <- function(data, query, eps, type, bucketSize, splitRule, approx) { .Call(`_dbscan_frNN_query_int`, data, query, eps, type, bucketSize, splitRule, approx) } distToAdjacency <- function(constraints, N) { .Call(`_dbscan_distToAdjacency`, constraints, N) } buildDendrogram <- function(hcl) { .Call(`_dbscan_buildDendrogram`, hcl) } all_children <- function(hier, key, leaves_only = FALSE) { .Call(`_dbscan_all_children`, hier, key, leaves_only) } node_xy <- function(cl_tree, cl_hierarchy, cid = 0L) { .Call(`_dbscan_node_xy`, cl_tree, cl_hierarchy, cid) } simplifiedTree <- function(cl_tree) { .Call(`_dbscan_simplifiedTree`, cl_tree) } computeStability <- function(hcl, minPts, compute_glosh = FALSE) { .Call(`_dbscan_computeStability`, hcl, minPts, compute_glosh) } validateConstraintList <- function(constraints, n) { .Call(`_dbscan_validateConstraintList`, constraints, n) } computeVirtualNode <- function(noise, constraints) { .Call(`_dbscan_computeVirtualNode`, noise, constraints) } fosc <- function(cl_tree, cid, sc, cl_hierarchy, prune_unstable_leaves = FALSE, cluster_selection_epsilon = 0.0, alpha = 0, useVirtual = FALSE, n_constraints = 0L, constraints = NULL) { .Call(`_dbscan_fosc`, cl_tree, cid, sc, cl_hierarchy, prune_unstable_leaves, cluster_selection_epsilon, alpha, useVirtual, n_constraints, constraints) } extractUnsupervised <- function(cl_tree, prune_unstable = FALSE, cluster_selection_epsilon = 0.0) { .Call(`_dbscan_extractUnsupervised`, cl_tree, prune_unstable, cluster_selection_epsilon) } extractSemiSupervised <- function(cl_tree, constraints, alpha = 0, prune_unstable_leaves = FALSE, cluster_selection_epsilon = 0.0) { .Call(`_dbscan_extractSemiSupervised`, cl_tree, constraints, alpha, prune_unstable_leaves, cluster_selection_epsilon) } kNN_query_int <- function(data, query, k, type, bucketSize, splitRule, approx) { .Call(`_dbscan_kNN_query_int`, data, query, k, type, bucketSize, splitRule, approx) } kNN_int <- function(data, k, type, bucketSize, splitRule, approx) { .Call(`_dbscan_kNN_int`, data, k, type, bucketSize, splitRule, approx) } lof_kNN <- function(data, minPts, type, bucketSize, splitRule, approx) { .Call(`_dbscan_lof_kNN`, data, minPts, type, bucketSize, splitRule, approx) } mrd <- function(dm, cd) { .Call(`_dbscan_mrd`, dm, cd) } mst <- function(x_dist, n) { .Call(`_dbscan_mst`, x_dist, n) } hclustMergeOrder <- function(mst, o) { .Call(`_dbscan_hclustMergeOrder`, mst, o) } optics_int <- function(data, eps, minPts, type, bucketSize, splitRule, approx, frNN) { .Call(`_dbscan_optics_int`, data, eps, minPts, type, bucketSize, splitRule, approx, frNN) } lowerTri <- function(m) { .Call(`_dbscan_lowerTri`, m) } ================================================ FILE: R/broom-dbscan-tidiers.R ================================================ #' Turn an dbscan clustering object into a tidy tibble #' #' Provides [tidy()][generics::tidy()], [augment()][generics::augment()], and #' [glance()][generics::glance()] verbs for clusterings created with algorithms #' in package `dbscan` to work with [tidymodels](https://www.tidymodels.org/). #' #' @param x An `dbscan` object returned from [dbscan::dbscan()]. #' @param data The data used to create the clustering. #' @param newdata New data to predict cluster labels for. #' @param ... further arguments are ignored without a warning. #' #' @name dbscan_tidiers #' @aliases dbscan_tidiers glance tidy augment #' @family tidiers #' #' @seealso [generics::tidy()], [generics::augment()], #' [generics::glance()], [dbscan()] #' #' @examplesIf requireNamespace("tibble", quietly = TRUE) && identical(Sys.getenv("NOT_CRAN"), "true") #' #' data(iris) #' x <- scale(iris[, 1:4]) #' #' ## dbscan #' db <- dbscan(x, eps = .9, minPts = 5) #' db #' #' # summarize model fit with tidiers #' tidy(db) #' glance(db) #' #' # augment for this model needs the original data #' augment(db, x) #' #' # to augment new data, the original data is also needed #' augment(db, x, newdata = x[1:5, ]) #' #' ## hdbscan #' hdb <- hdbscan(x, minPts = 5) #' #' # summarize model fit with tidiers #' tidy(hdb) #' glance(hdb) #' #' # augment for this model needs the original data #' augment(hdb, x) #' #' # to augment new data, the original data is also needed #' augment(hdb, x, newdata = x[1:5, ]) #' #' ## Jarvis-Patrick clustering #' cl <- jpclust(x, k = 20, kt = 15) #' #' # summarize model fit with tidiers #' tidy(cl) #' glance(cl) #' #' # augment for this model needs the original data #' augment(cl, x) #' #' ## Shared Nearest Neighbor clustering #' cl <- sNNclust(x, k = 20, eps = 0.8, minPts = 15) #' #' # summarize model fit with tidiers #' tidy(cl) #' glance(cl) #' #' # augment for this model needs the original data #' augment(cl, x) #' NULL #' @rdname dbscan_tidiers #' @importFrom generics tidy #' @export generics::tidy #' @rdname dbscan_tidiers #' @export tidy.dbscan <- function(x, ...) { n_cl <- max(x$cluster) size <- table(factor(x$cluster, levels = 0:n_cl)) tb <- tibble::tibble(cluster = as.factor(0:n_cl), size = as.integer(size)) tb$noise <- tb$cluster == 0L tb } #' @rdname dbscan_tidiers #' @export tidy.hdbscan <- function(x, ...) { n_cl <- max(x$cluster) size <- table(factor(x$cluster, levels = 0:n_cl)) tb <- tibble::tibble(cluster = as.factor(0:n_cl), size = as.integer(size)) tb$cluster_score <- as.numeric(x$cluster_scores[as.character(tb$cluster)]) tb$noise <- tb$cluster == 0L tb } #' @rdname dbscan_tidiers #' @export tidy.general_clustering <- function(x, ...) { n_cl <- max(x$cluster) size <- table(factor(x$cluster, levels = 0:n_cl)) tb <- tibble::tibble(cluster = as.factor(0:n_cl), size = as.integer(size)) tb$noise <- tb$cluster == 0L tb } ## augment #' @importFrom generics augment #' @rdname dbscan_tidiers #' @export generics::augment #' @rdname dbscan_tidiers #' @export augment.dbscan <- function(x, data = NULL, newdata = NULL, ...) { n_cl <- max(x$cluster) if (is.null(data) && is.null(newdata)) stop("Must specify either `data` or `newdata` argument.") if (is.null(data) || nrow(data) != length(x$cluster)) { stop("The original data needs to be passed as data.") } if (is.null(newdata)) { tb <- tibble::as_tibble(data) tb$.cluster <- factor(x$cluster, levels = 0:n_cl) } else { tb <- tibble::as_tibble(newdata) tb$.cluster <- factor(predict(x, newdata = newdata, data = data), levels = 0:n_cl) } tb$noise <- tb$.cluster == 0L tb } #' @rdname dbscan_tidiers #' @export augment.hdbscan <- function(x, data = NULL, newdata = NULL, ...) { n_cl <- max(x$cluster) if (is.null(data) || nrow(data) != length(x$cluster)) { stop("The original data needs to be passed as data.") } if (is.null(newdata)) { tb <- tibble::as_tibble(data) tb$.cluster <- factor(x$cluster, levels = 0:n_cl) tb$.coredist <- x$coredist tb$.membership_prob <- x$membership_prob tb$.outlier_scores <- x$outlier_scores } else { tb <- tibble::as_tibble(newdata) tb$.cluster <- factor( predict(x, newdata = newdata, data = data), levels = 0:n_cl) tb$.coredist <- NA_real_ tb$.membership_prob <- NA_real_ tb$.outlier_scores <- NA_real_ } tb } #' @rdname dbscan_tidiers #' @export augment.general_clustering <- function(x, data = NULL, newdata = NULL, ...) { n_cl <- max(x$cluster) if (is.null(data) || nrow(data) != length(x$cluster)) { stop("The original data needs to be passed as data.") } if (is.null(newdata)) { tb <- tibble::as_tibble(data) tb$.cluster <- factor(x$cluster, levels = 0:n_cl) } else { stop("augmenting new data is not supported.") } tb } ## glance #' @importFrom generics glance #' @rdname dbscan_tidiers #' @export generics::glance #' @rdname dbscan_tidiers #' @export glance.dbscan <- function(x, ...) { tibble::tibble( nobs = length(x$cluster), n.clusters = length(table(x$cluster[x$cluster != 0L])), nexcluded = sum(x$cluster == 0L) ) } #' @rdname dbscan_tidiers #' @export glance.hdbscan <- function(x, ...) { tibble::tibble( nobs = length(x$cluster), n.clusters = length(table(x$cluster[x$cluster != 0L])), nexcluded = sum(x$cluster == 0L) ) } #' @rdname dbscan_tidiers #' @export glance.general_clustering <- function(x, ...) { tibble::tibble( nobs = length(x$cluster), n.clusters = length(table(x$cluster[x$cluster != 0L])), nexcluded = sum(x$cluster == 0L) ) } ================================================ FILE: R/comps.R ================================================ ####################################################################### # dbscan - Density Based Clustering of Applications with Noise # and Related Algorithms # Copyright (C) 2017 Michael Hahsler # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License along # with this program; if not, write to the Free Software Foundation, Inc., # 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. #' Find Connected Components in a Nearest-neighbor Graph #' #' Generic function and methods to find connected components in nearest neighbor graphs. #' #' Note that for kNN graphs, one point may be in the kNN of the other but nor vice versa. #' `mutual = TRUE` requires that both points are in each other's kNN. #' #' @family NN functions #' @aliases components #' #' @param x the [NN] object representing the graph or a [dist] object #' @param eps threshold on the distance #' @param mutual for a pair of points, do both have to be in each other's neighborhood? #' @param ... further arguments are currently unused. #' #' @return an integer vector with component assignments. #' #' @author Michael Hahsler #' @keywords model #' @examples #' set.seed(665544) #' n <- 100 #' x <- cbind( #' x=runif(10, 0, 5) + rnorm(n, sd = 0.4), #' y=runif(10, 0, 5) + rnorm(n, sd = 0.4) #' ) #' plot(x, pch = 16) #' #' # Connected components on a graph where each pair of points #' # with a distance less or equal to eps are connected #' d <- dist(x) #' components <- comps(d, eps = .8) #' plot(x, col = components, pch = 16) #' #' # Connected components in a fixed radius nearest neighbor graph #' # Gives the same result as the threshold on the distances above #' frnn <- frNN(x, eps = .8) #' components <- comps(frnn) #' plot(frnn, data = x, col = components) #' #' # Connected components on a k nearest neighbors graph #' knn <- kNN(x, 3) #' components <- comps(knn, mutual = FALSE) #' plot(knn, data = x, col = components) #' #' components <- comps(knn, mutual = TRUE) #' plot(knn, data = x, col = components) #' #' # Connected components in a shared nearest neighbor graph #' snn <- sNN(x, k = 10, kt = 5) #' components <- comps(snn) #' plot(snn, data = x, col = components) #' @export comps <- function(x, ...) UseMethod("comps", x) #' @rdname comps #' @export comps.dist <- function(x, eps, ...) stats::cutree(stats::hclust(x, method = "single"), h = eps) #' @rdname comps #' @export comps.kNN <- function(x, mutual = FALSE, ...) as.integer(factor(comps_kNN(x$id, as.logical(mutual)))) # sNN and frNN are symmetric so no need for mutual #' @rdname comps #' @export comps.sNN <- function(x, ...) comps.kNN(x, mutual = FALSE) #' @rdname comps #' @export comps.frNN <- function(x, ...) comps_frNN(x$id, mutual = FALSE) ================================================ FILE: R/dbcv.R ================================================ ####################################################################### # dbscan - Density Based Clustering of Applications with Noise # and Related Algorithms # Copyright (C) 2024 Michael Hahsler, Matt Piekenbrock # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License along # with this program; if not, write to the Free Software Foundation, Inc., # 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. #' Density-Based Clustering Validation Index (DBCV) #' #' Calculate the Density-Based Clustering Validation Index (DBCV) for a #' clustering. #' #' DBCV (Moulavi et al, 2014) computes a score based on the density sparseness of each cluster #' and the density separation of each pair of clusters. #' #' The density sparseness of a cluster (DSC) is defined as the maximum edge weight of #' a minimal spanning tree for the internal points of the cluster using the mutual #' reachability distance based on the all-points-core-distance. Internal points #' are connected to more than one other point in the cluster. Since clusters of #' a size less then 3 cannot have internal points, they are ignored (considered #' noise) in this implementation. #' #' The density separation of a pair of clusters (DSPC) #' is defined as the minimum reachability distance between the internal nodes of #' the spanning trees of the two clusters. #' #' The validity index for a cluster is calculated using these measures and aggregated #' to a validity index for the whole clustering using a weighted average. #' #' The index is in the range \eqn{[-1,1]}. If the cluster density compactness is better #' than the density separation, a positive value is returned. The actual value depends #' on the separability of the data. In general, greater values #' of the measure indicating a better density-based clustering solution. #' #' Noise points are included in the calculation only in the weighted average, #' therefore clustering with more noise points will get a lower index. #' #' **Performance note:** This implementation calculates a distance matrix and thus #' can only be used for small or sampled datasets. #' #' @aliases dbcv DBCV #' @family Evaluation Functions #' #' @param x a data matrix or a dist object. #' @param cl a clustering (e.g., a integer vector) #' @param d dimensionality of the original data if a dist object is provided. #' @param metric distance metric used. The available metrics are the methods #' implemented by `dist()` plus `"sqeuclidean"` for the squared #' Euclidean distance used in the original DBCV implementation. #' @param sample sample size used for large datasets. #' #' @return A list with the DBCV `score` for the clustering, #' the density sparseness of cluster (`dsc`) values, #' the density separation of pairs of clusters (`dspc`) distances, #' and the validity indices of clusters (`c_c`). #' #' @author Matt Piekenbrock and Michael Hahsler #' @references Davoud Moulavi and Pablo A. Jaskowiak and #' Ricardo J. G. B. Campello and Arthur Zimek and Jörg Sander (2014). #' Density-Based Clustering Validation. In #' _Proceedings of the 2014 SIAM International Conference on Data Mining,_ #' pages 839-847 #' \doi{10.1137/1.9781611973440.96} #' #' Pablo A. Jaskowiak (2022). MATLAB implementation of DBCV. #' \url{https://github.com/pajaskowiak/dbcv} #' @examples #' # Load a test dataset #' data(Dataset_1) #' x <- Dataset_1[, c("x", "y")] #' class <- Dataset_1$class #' #' clplot(x, class) #' #' # We use MinPts 3 and use the knee at eps = .1 for dbscan #' kNNdistplot(x, minPts = 3) #' #' cl <- dbscan(x, eps = .1, minPts = 3) #' clplot(x, cl) #' #' dbcv(x, cl) #' #' # compare to the DBCV index on the original class labels and #' # with a random partitioning #' dbcv(x, class) #' dbcv(x, sample(1:4, replace = TRUE, size = nrow(x))) #' #' # find the best eps using dbcv #' eps_grid <- seq(.05,.2, by = .01) #' cls <- lapply(eps_grid, FUN = function(e) dbscan(x, eps = e, minPts = 3)) #' dbcvs <- sapply(cls, FUN = function(cl) dbcv(x, cl)$score) #' #' plot(eps_grid, dbcvs, type = "l") #' #' eps_opt <- eps_grid[which.max(dbcvs)] #' eps_opt #' #' cl <- dbscan(x, eps = eps_opt, minPts = 3) #' clplot(x, cl) #' @export dbcv <- function(x, cl, d, metric = "euclidean", sample = NULL) { # a clustering with a cluster element if (is.list(cl)) { cl <- cl$cluster } if (inherits(x, "dist")) { xdist <- x if (missing(d)) stop("d needs to be specified if a distance matrix is supplied!") } else if (.matrixlike(x)) { if (!is.null(sample)) { take <- sample(nrow(x), size = sample) x <- x[take, ] cl <- cl[take] } x <- as.matrix(x) if (!missing(d) && d != ncol(x)) stop("d does not match the number of columns in x!") d <- ncol(x) if (pmatch(metric, "sqeuclidean", nomatch = 0)) xdist <- dist(x, method = "euclidean")^2 else xdist <- dist(x, method = metric) } else stop("'dbcv' expects x needs to be a matrix to calculate distances.") .check_dist(xdist) n <- attr(xdist, "Size") # in case we get a factor cl <- as.integer(cl) if (length(cl) != n) stop("cl does not match the number of rows in x!") ## calculate everything for all non-noise points ordered by cluster ## getClusterIdList removes noise points and singleton clusters ## and returns indices reorder by cluster cl_idx_list <- getClusterIdList(cl) n_cl <- length(cl_idx_list) ## reordered distances w/o noise all_dist <- dist_subset(xdist, unlist(cl_idx_list)) new_cl_idx_list <- list() i <- 1L start <- 1 for(l in lengths(cl_idx_list)) { end <- start + l - 1 new_cl_idx_list[[i]] <- seq(start, end) start <- end + 1 i <- i + 1L } cl_idx_list <- new_cl_idx_list all_idx <- unlist(cl_idx_list) ## 1. Calculate all-points-core-distance ## Calculate the all-points-core-distance for each point, within each cluster ## Note: this needs the dimensionality of the data d all_pts_core_dist <- unlist(lapply( cl_idx_list, FUN = function(ids) { dists <- (rowSums(as.matrix(( 1 / dist_subset(all_dist, ids) )^d)) / (length(ids) - 1))^(-1 / d) } )) ## 2. Create for each cluster a mutual reachability MSTs all_mrd <- structure(mrd(all_dist, all_pts_core_dist), class = "dist", Size = length(all_idx)) ## Noise points are removed, but the index is affected by dividing by the ## total number of objects including the noise points (n)! ## mst is a matrix with columns: from to and weight mrd_graphs <- lapply(cl_idx_list, function(idx) { mst(x_dist = dist_subset(all_mrd, idx), n = length(idx)) }) ## 3. Density Sparseness of a Cluster (DSC): ## The maximum edge weight of the internal edges in the cluster's ## mutual reachability MST. ## find internal nodes for DSC and DSPC. Internal nodes have a degree > 1 internal_nodes <- lapply(mrd_graphs, function(mst) { node_deg <- table(c(mst[, 1], mst[, 2])) idx <- as.integer(names(node_deg)[node_deg > 1]) idx }) dsc <- mapply(function(mst, int_idx) { # find internal edges int_edge_idx <- which((mst[, 1L] %in% int_idx) & (mst[, 2L] %in% int_idx)) if (length(int_edge_idx) == 0L) { return(max(mst[, 3L])) } max(mst[int_edge_idx, 3L]) }, mrd_graphs, internal_nodes) ## 4. Density Separation of a Pair of Clusters (DSPC): ## The minimum reachability distance between the internal nodes of the ## internal nodes of a pair of MST_MRD's of clusters Ci and Cj dspc_dist <- dspc(cl_idx_list, internal_nodes, all_idx, all_mrd) # returns a matrix with Ci, Cj, dist # make it into a full distance matrix dspc_dist <- dspc_dist[, 3L] class(dspc_dist) <- "dist" attr(dspc_dist, "Size") <- n_cl attr(dspc_dist, "Diag") <- FALSE attr(dspc_dist, "Upper") <- FALSE dspc_mm <- as.matrix(dspc_dist) diag(dspc_mm) <- NA ## 5. Validity index of a cluster: min_separation <- apply(dspc_mm, MARGIN = 1, min, na.rm = TRUE) v_c <- (min_separation - dsc) / pmax(min_separation, dsc) ## 5. Validity index for whole clustering res <- sum(lengths(cl_idx_list) / n * v_c) return(list( score = res, n = n, n_c = lengths(cl_idx_list), d = d, dsc = dsc, dspc = dspc_dist, v_c = v_c )) } getClusterIdList <- function(cl) { ## In DBCV, singletons are ambiguously defined. However, they cannot be ## considered valid clusters, for reasons listed in section 4 of the ## original paper. ## Clusters with less then 3 points cannot have internal nodes, so we need to ## ignore them as well. ## To ensure coverage, they are assigned into the noise category. cl_freq <- table(cl) cl[cl %in% as.integer(names(which(cl_freq < 3)))] <- 0L if (all(cl == 0)) { return(0) } cl_ids <- unique(cl) # all cluster ids cl_valid <- cl_ids[cl_ids != 0] # valid cluster indices (non-noise) n_cl <- length(cl_valid) # number of clusters ## 1 or 0 clusters results in worst score + a warning if (n_cl <= 1) { warning("DBCV is undefined for less than 2 non-noise clusters with more than 2 member points.") return(-1L) } ## Indexes cl_ids_idx <- lapply(cl_valid, function(id) sort(which(cl == id))) ## the sort is important for indexing purposes return(cl_ids_idx) } ================================================ FILE: R/dbscan.R ================================================ ####################################################################### # dbscan - Density Based Clustering of Applications with Noise # and Related Algorithms # Copyright (C) 2015 Michael Hahsler # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License along # with this program; if not, write to the Free Software Foundation, Inc., # 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. #' Density-based Spatial Clustering of Applications with Noise (DBSCAN) #' #' Fast reimplementation of the DBSCAN (Density-based spatial clustering of #' applications with noise) clustering algorithm using a kd-tree. #' #' The #' implementation is significantly faster and can work with larger data sets #' than [fpc::dbscan()] in \pkg{fpc}. Use `dbscan::dbscan()` (with specifying the package) to #' call this implementation when you also load package \pkg{fpc}. #' #' **The algorithm** #' #' This implementation of DBSCAN follows the original #' algorithm as described by Ester et al (1996). DBSCAN performs the following steps: #' #' 1. Estimate the density #' around each data point by counting the number of points in a user-specified #' eps-neighborhood and applies a used-specified minPts thresholds to identify #' - core points (points with more than minPts points in their neighborhood), #' - border points (non-core points with a core point in their neighborhood) and #' - noise points (all other points). #' 2. Core points form the backbone of clusters by joining them into #' a cluster if they are density-reachable from each other (i.e., there is a chain of core #' points where one falls inside the eps-neighborhood of the next). #' 3. Border points are assigned to clusters. The algorithm needs parameters #' `eps` (the radius of the epsilon neighborhood) and `minPts` (the #' density threshold). #' #' Border points are arbitrarily assigned to clusters in the original #' algorithm. DBSCAN* (see Campello et al 2013) treats all border points as #' noise points. This is implemented with `borderPoints = FALSE`. #' #' **Specifying the data** #' #' If `x` is a matrix or a data.frame, then fast fixed-radius nearest #' neighbor computation using a kd-tree is performed using Euclidean distance. #' See [frNN()] for more information on the parameters related to #' nearest neighbor search. **Note** that only numerical values are allowed in `x`. #' #' Any precomputed distance matrix (dist object) can be specified as `x`. #' You may run into memory issues since distance matrices are large. #' #' A precomputed frNN object can be supplied as `x`. In this case #' `eps` does not need to be specified. This option us useful for large #' data sets, where a sparse distance matrix is available. See #' [frNN()] how to create frNN objects. #' #' **Setting parameters for DBSCAN** #' #' The parameters `minPts` and `eps` define the minimum density required #' in the area around core points which form the backbone of clusters. #' `minPts` is the number of points #' required in the neighborhood around the point defined by the parameter `eps` #' (i.e., the radius around the point). Both parameters #' depend on each other and changing one typically requires changing #' the other one as well. The parameters also depend on the size of the data set with #' larger datasets requiring a larger `minPts` or a smaller `eps`. #' #' * `minPts:` The original #' DBSCAN paper (Ester et al, 1996) suggests to start by setting \eqn{\text{minPts} \ge d + 1}, #' the data dimensionality plus one or higher with a minimum of 3. Larger values #' are preferable since increasing the parameter suppresses more noise in the data #' by requiring more points to form clusters. #' Sander et al (1998) uses in the examples two times the data dimensionality. #' Note that setting \eqn{\text{minPts} \le 2} is equivalent to hierarchical clustering #' with the single link metric and the dendrogram cut at height `eps`. #' #' * `eps:` A suitable neighborhood size #' parameter `eps` given a fixed value for `minPts` can be found #' visually by inspecting the [kNNdistplot()] of the data using #' \eqn{k = \text{minPts} - 1} (`minPts` includes the point itself, while the #' k-nearest neighbors distance does not). The k-nearest neighbor distance plot #' sorts all data points by their k-nearest neighbor distance. A sudden #' increase of the kNN distance (a knee) indicates that the points to the right #' are most likely outliers. Choose `eps` for DBSCAN where the knee is. #' #' **Predict cluster memberships** #' #' [predict()] can be used to predict cluster memberships for new data #' points. A point is considered a member of a cluster if it is within the eps #' neighborhood of a core point of the cluster. Points #' which cannot be assigned to a cluster will be reported as #' noise points (i.e., cluster ID 0). #' **Important note:** `predict()` currently can only use Euclidean distance to determine #' the neighborhood of core points. If `dbscan()` was called using distances other than Euclidean, #' then the neighborhood calculation will not be correct and only approximated by Euclidean #' distances. If the data contain factor columns (e.g., using Gower's distance), then #' the factors in `data` and `query` first need to be converted to numeric to use the #' Euclidean approximation. #' #' #' @aliases dbscan DBSCAN print.dbscan_fast #' @family clustering functions #' #' @param x a data matrix, a data.frame, a [dist] object or a [frNN] object with #' fixed-radius nearest neighbors. #' @param eps size (radius) of the epsilon neighborhood. Can be omitted if #' `x` is a frNN object. #' @param minPts number of minimum points required in the eps neighborhood for #' core points (including the point itself). #' @param weights numeric; weights for the data points. Only needed to perform #' weighted clustering. #' @param borderPoints logical; should border points be assigned to clusters. #' The default is `TRUE` for regular DBSCAN. If `FALSE` then border #' points are considered noise (see DBSCAN* in Campello et al, 2013). #' @param ... additional arguments are passed on to the fixed-radius nearest #' neighbor search algorithm. See [frNN()] for details on how to #' control the search strategy. #' #' @return `dbscan()` returns an object of class `dbscan_fast` with the following components: #' #' \item{eps }{ value of the `eps` parameter.} #' \item{minPts }{ value of the `minPts` parameter.} #' \item{metric }{ used distance metric.} #' \item{cluster }{A integer vector with cluster assignments. Zero indicates noise points.} #' #' `is.corepoint()` returns a logical vector indicating for each data point if it is a #' core point. #' #' @author Michael Hahsler #' @references Hahsler M, Piekenbrock M, Doran D (2019). dbscan: Fast #' Density-Based Clustering with R. _Journal of Statistical Software,_ #' 91(1), 1-30. #' \doi{10.18637/jss.v091.i01} #' #' Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu (1996). A #' Density-Based Algorithm for Discovering Clusters in Large Spatial Databases #' with Noise. Institute for Computer Science, University of Munich. #' _Proceedings of 2nd International Conference on Knowledge Discovery and #' Data Mining (KDD-96),_ 226-231. #' \url{https://dl.acm.org/doi/10.5555/3001460.3001507} #' #' Campello, R. J. G. B.; Moulavi, D.; Sander, J. (2013). Density-Based #' Clustering Based on Hierarchical Density Estimates. Proceedings of the #' 17th Pacific-Asia Conference on Knowledge Discovery in Databases, PAKDD #' 2013, _Lecture Notes in Computer Science_ 7819, p. 160. #' \doi{10.1007/978-3-642-37456-2_14} #' #' Sander, J., Ester, M., Kriegel, HP. et al. (1998). Density-Based #' Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications. #' _Data Mining and Knowledge Discovery_ 2, 169-194. #' \doi{10.1023/A:1009745219419} #' #' @keywords model clustering #' @examples #' ## Example 1: use dbscan on the iris data set #' data(iris) #' iris <- as.matrix(iris[, 1:4]) #' #' ## Find suitable DBSCAN parameters: #' ## 1. We use minPts = dim + 1 = 5 for iris. A larger value can also be used. #' ## 2. We inspect the k-NN distance plot for k = minPts - 1 = 4 #' kNNdistplot(iris, minPts = 5) #' #' ## Noise seems to start around a 4-NN distance of .7 #' abline(h=.7, col = "red", lty = 2) #' #' ## Cluster with the chosen parameters #' res <- dbscan(iris, eps = .7, minPts = 5) #' res #' #' pairs(iris, col = res$cluster + 1L) #' clplot(iris, res) #' #' ## Use a precomputed frNN object #' fr <- frNN(iris, eps = .7) #' dbscan(fr, minPts = 5) #' #' ## Example 2: use data from fpc #' set.seed(665544) #' n <- 100 #' x <- cbind( #' x = runif(10, 0, 10) + rnorm(n, sd = 0.2), #' y = runif(10, 0, 10) + rnorm(n, sd = 0.2) #' ) #' #' res <- dbscan(x, eps = .3, minPts = 3) #' res #' #' ## plot clusters and add noise (cluster 0) as crosses. #' plot(x, col = res$cluster) #' points(x[res$cluster == 0, ], pch = 3, col = "grey") #' #' clplot(x, res) #' hullplot(x, res) #' #' ## Predict cluster membership for new data points #' ## (Note: 0 means it is predicted as noise) #' newdata <- x[1:5,] + rnorm(10, 0, .3) #' hullplot(x, res) #' points(newdata, pch = 3 , col = "red", lwd = 3) #' text(newdata, pos = 1) #' #' pred_label <- predict(res, newdata, data = x) #' pred_label #' points(newdata, col = pred_label + 1L, cex = 2, lwd = 2) #' #' ## Compare speed against fpc version (if microbenchmark is installed) #' ## Note: we use dbscan::dbscan to make sure that we do now run the #' ## implementation in fpc. #' \dontrun{ #' if (requireNamespace("fpc", quietly = TRUE) && #' requireNamespace("microbenchmark", quietly = TRUE)) { #' t_dbscan <- microbenchmark::microbenchmark( #' dbscan::dbscan(x, .3, 3), times = 10, unit = "ms") #' t_dbscan_linear <- microbenchmark::microbenchmark( #' dbscan::dbscan(x, .3, 3, search = "linear"), times = 10, unit = "ms") #' t_dbscan_dist <- microbenchmark::microbenchmark( #' dbscan::dbscan(x, .3, 3, search = "dist"), times = 10, unit = "ms") #' t_fpc <- microbenchmark::microbenchmark( #' fpc::dbscan(x, .3, 3), times = 10, unit = "ms") #' #' r <- rbind(t_fpc, t_dbscan_dist, t_dbscan_linear, t_dbscan) #' r #' #' boxplot(r, #' names = c('fpc', 'dbscan (dist)', 'dbscan (linear)', 'dbscan (kdtree)'), #' main = "Runtime comparison in ms") #' #' ## speedup of the kd-tree-based version compared to the fpc implementation #' median(t_fpc$time) / median(t_dbscan$time) #' }} #' #' ## Example 3: manually create a frNN object for dbscan (dbscan only needs ids and eps) #' nn <- structure(list(id = list(c(2,3), c(1,3), c(1,2,3), c(3,5), c(4,5)), eps = 1), #' class = c("NN", "frNN")) #' nn #' dbscan(nn, minPts = 2) #' #' @export dbscan <- function(x, eps, minPts = 5, weights = NULL, borderPoints = TRUE, ...) { if (inherits(x, "frNN") && missing(eps)) { eps <- x$eps dist_method <- x$metric } if (inherits(x, "dist")) { .check_dist(x) dist_method <- attr(x, "method") } else dist_method <- "euclidean" dist_method <- dist_method %||% "unknown" ### extra contains settings for frNN ### search = "kdtree", bucketSize = 10, splitRule = "suggest", approx = 0 ### also check for MinPts for fpc compatibility (does not work for ### search method dist) extra <- list(...) args <- c("MinPts", "search", "bucketSize", "splitRule", "approx") m <- pmatch(names(extra), args) if (anyNA(m)) stop("Unknown parameter: ", toString(names(extra)[is.na(m)])) names(extra) <- args[m] # fpc compartability if (!is.null(extra$MinPts)) { warning("converting argument MinPts (fpc) to minPts (dbscan)!") minPts <- extra$MinPts extra$MinPts <- NULL } search <- .parse_search(extra$search %||% "kdtree") splitRule <- .parse_splitRule(extra$splitRule %||% "suggest") bucketSize <- as.integer(extra$bucketSize %||% 10L) approx <- as.integer(extra$approx %||% 0L) ### do dist search if (search == 3L && !inherits(x, "dist")) { if (.matrixlike(x)) x <- dist(x) else stop("x needs to be a matrix to calculate distances") } ## for dist we provide the R code with a frNN list and no x frNN <- list() if (inherits(x, "dist")) { frNN <- frNN(x, eps, ...)$id x <- matrix(0.0, nrow = 0, ncol = 0) } else if (inherits(x, "frNN")) { if (x$eps != eps) { eps <- x$eps warning("Using the eps of ", eps, " provided in the fixed-radius NN object.") } frNN <- x$id x <- matrix(0.0, nrow = 0, ncol = 0) } else { if (!.matrixlike(x)) stop("x needs to be a matrix or data.frame.") ## make sure x is numeric x <- as.matrix(x) if (storage.mode(x) == "integer") storage.mode(x) <- "double" if (storage.mode(x) != "double") stop("all data in x has to be numeric.") } if (length(frNN) == 0 && anyNA(x)) stop("data/distances cannot contain NAs for dbscan (with kd-tree)!") ## add self match and use C numbering if frNN is used if (length(frNN) > 0L) frNN <- lapply( seq_along(frNN), FUN = function(i) c(i - 1L, frNN[[i]] - 1L) ) if (length(minPts) != 1L || !is.finite(minPts) || minPts < 0) stop("minPts need to be a single integer >=0.") if (is.null(eps) || is.na(eps) || eps < 0) stop("eps needs to be >=0.") ret <- dbscan_int( x, as.double(eps), as.integer(minPts), as.double(weights), as.integer(borderPoints), as.integer(search), as.integer(bucketSize), as.integer(splitRule), as.double(approx), frNN ) structure( list( cluster = ret, eps = eps, minPts = minPts, metric = dist_method, borderPoints = borderPoints ), class = c("dbscan_fast", "dbscan") ) } #' @export print.dbscan_fast <- function(x, ...) { writeLines(c( paste0("DBSCAN clustering for ", nobs(x), " objects."), paste0("Parameters: eps = ", x$eps, ", minPts = ", x$minPts), paste0( "Using ", x$metric, " distances and borderpoints = ", x$borderPoints ), paste0( "The clustering contains ", ncluster(x), " cluster(s) and ", nnoise(x), " noise points." ) )) print(table(x$cluster)) cat("\n") writeLines(strwrap(paste0( "Available fields: ", toString(names(x)) ), exdent = 18)) } #' @rdname dbscan #' @export is.corepoint <- function(x, eps, minPts = 5, ...) lengths(frNN(x, eps = eps, ...)$id) >= (minPts - 1) ================================================ FILE: R/dendrogram.R ================================================ ####################################################################### # dbscan - Density Based Clustering of Applications with Noise # and Related Algorithms # Copyright (C) 2015 Michael Hahsler, Matt Piekenbrock # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License along # with this program; if not, write to the Free Software Foundation, Inc., # 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. #' Coersions to Dendrogram #' #' Provides a new generic function to coerce objects to dendrograms with #' [stats::as.dendrogram()] as the default. Additional methods for #' [hclust], [hdbscan] and [reachability] objects are provided. #' #' Coersion methods for #' [hclust], [hdbscan] and [reachability] objects to [dendrogram] are provided. #' #' The coercion from `hclust` is a faster C++ reimplementation of the coercion in #' package `stats`. The original implementation can be called #' using [stats::as.dendrogram()]. #' #' The coersion from [hdbscan] builds the non-simplified HDBSCAN hierarchy as a #' dendrogram object. #' #' @name dendrogram #' @aliases dendrogram #' #' @param object the object #' @param ... further arguments NULL #' @rdname dendrogram #' @export as.dendrogram <- function (object, ...) { UseMethod("as.dendrogram", object) } #' @rdname dendrogram #' @export as.dendrogram.default <- function (object, ...) stats::as.dendrogram(object, ...) ## this is a replacement for stats::as.dendrogram for hclust #' @rdname dendrogram #' @export as.dendrogram.hclust <- function(object, ...) { return(buildDendrogram(object)) } #' @rdname dendrogram #' @export as.dendrogram.hdbscan <- function(object, ...) { return(buildDendrogram(object$hc)) } #' @rdname dendrogram #' @export as.dendrogram.reachability <- function(object, ...) { if (sum(is.infinite(object$reachdist)) > 1) stop( "Multiple Infinite reachability distances found. Reachability plots can only be converted if they contain enough information to fully represent the dendrogram structure. If using OPTICS, a larger eps value (such as Inf) may be needed in the parameterization." ) #dup_x <- object c_order <- order(object$reachdist) - 1 # dup_x$order <- dup_x$order - 1 #q_order <- sapply(c_order, function(i) which(dup_x$order == i)) res <- reach_to_dendrogram(object, c_order) # res <- dendrapply(res, function(leaf) { new_leaf <- leaf[[1]]; attributes(new_leaf) <- attributes(leaf); new_leaf }) # add mid points for plotting res <- .midcache.dendrogram(res) res } # calculate midpoints for dendrogram # from stats, but not exported # see stats:::midcache.dendrogram .midcache.dendrogram <- function(x, type = "hclust", quiet = FALSE) { type <- match.arg(type) stopifnot(inherits(x, "dendrogram")) verbose <- getOption("verbose", 0) >= 2 setmid <- function(d, type) { depth <- 0L kk <- integer() jj <- integer() dd <- list() repeat { if (!is.leaf(d)) { k <- length(d) if (k < 1) stop("dendrogram node with non-positive #{branches}") depth <- depth + 1L if (verbose) cat(sprintf(" depth(+)=%4d, k=%d\n", depth, k)) kk[depth] <- k if (storage.mode(jj) != storage.mode(kk)) storage.mode(jj) <- storage.mode(kk) dd[[depth]] <- d d <- d[[jj[depth] <- 1L]] next } while (depth) { k <- kk[depth] j <- jj[depth] r <- dd[[depth]] r[[j]] <- unclass(d) if (j < k) break depth <- depth - 1L if (verbose) cat(sprintf(" depth(-)=%4d, k=%d\n", depth, k)) midS <- sum(vapply(r, .midDend, 0)) if (!quiet && type == "hclust" && k != 2) warning("midcache() of non-binary dendrograms only partly implemented") attr(r, "midpoint") <- (.memberDend(r[[1L]]) + midS) / 2 d <- r } if (!depth) break dd[[depth]] <- r d <- r[[jj[depth] <- j + 1L]] } d } setmid(x, type = type) } .midDend <- function(x) { attr(x, "midpoint") %||% 0 } .memberDend <- function(x) { attr(x, "x.member") %||% attr(x, "members") %||% 1 } ================================================ FILE: R/extractFOSC.R ================================================ ####################################################################### # dbscan - Density Based Clustering of Applications with Noise # and Related Algorithms # Copyright (C) 2015 Michael Hahsler, Matt Piekenbrock # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License along # with this program; if not, write to the Free Software Foundation, Inc., # 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. #' Framework for the Optimal Extraction of Clusters from Hierarchies #' #' Generic reimplementation of the _Framework for Optimal Selection of Clusters_ #' (FOSC; Campello et al, 2013) to extract clusterings from hierarchical clustering (i.e., #' [hclust] objects). #' Can be parameterized to perform unsupervised #' cluster extraction through a stability-based measure, or semisupervised #' cluster extraction through either a constraint-based extraction (with a #' stability-based tiebreaker) or a mixed (weighted) constraint and #' stability-based objective extraction. #' #' Campello et al (2013) suggested a _Framework for Optimal Selection of #' Clusters_ (FOSC) as a framework to make local (non-horizontal) cuts to any #' cluster tree hierarchy. This function implements the original extraction #' algorithms as described by the framework for hclust objects. Traditional #' cluster extraction methods from hierarchical representations (such as #' [hclust] objects) generally rely on global parameters or cutting values #' which are used to partition a cluster hierarchy into a set of disjoint, flat #' clusters. This is implemented in R in function [stats::cutree()]. #' Although such methods are widespread, using global parameter #' settings are inherently limited in that they cannot capture patterns within #' the cluster hierarchy at varying _local_ levels of granularity. #' #' Rather than partitioning a hierarchy based on the number of the cluster one #' expects to find (\eqn{k}) or based on some linkage distance threshold #' (\eqn{H}), the FOSC proposes that the optimal clusters may exist at varying #' distance thresholds in the hierarchy. To enable this idea, FOSC requires one #' parameter (minPts) that represents _the minimum number of points that #' constitute a valid cluster._ The first step of the FOSC algorithm is to #' traverse the given cluster hierarchy divisively, recording new clusters at #' each split if both branches represent more than or equal to minPts. Branches #' that contain less than minPts points at one or both branches inherit the #' parent clusters identity. Note that using FOSC, due to the constraint that #' minPts must be greater than or equal to 2, it is possible that the optimal #' cluster solution chosen makes local cuts that render parent branches of #' sizes less than minPts as noise, which are denoted as 0 in the final #' solution. #' #' Traversing the original cluster tree using minPts creates a new, simplified #' cluster tree that is then post-processed recursively to extract clusters #' that maximize for each cluster \eqn{C_i}{Ci} the cost function #' #' \deqn{\max_{\delta_2, \dots, \delta_k} J = \sum\limits_{i=2}^{k} \delta_i #' S(C_i)}{ J = \sum \delta S(Ci) for all i clusters, } where #' \eqn{S(C_i)}{S(Ci)} is the stability-based measure as \deqn{ S(C_i) = #' \sum_{x_j \in C_i}(\frac{1}{h_{min} (x_j, C_i)} - \frac{1}{h_{max} (C_i)}) #' }{ S(Ci) = \sum (1/Hmin(Xj, Ci) - 1/Hmax(Ci)) for all Xj in Ci.} #' #' \eqn{\delta_i}{\delta} represents an indicator function, which constrains #' the solution space such that clusters must be disjoint (cannot assign more #' than 1 label to each cluster). The measure \eqn{S(C_i)}{S(Ci)} used by FOSC #' is an unsupervised validation measure based on the assumption that, if you #' vary the linkage/distance threshold across all possible values, more #' prominent clusters that survive over many threshold variations should be #' considered as stronger candidates of the optimal solution. For this reason, #' using this measure to detect clusters is referred to as an unsupervised, #' _stability-based_ extraction approach. In some cases it may be useful #' to enact _instance-level_ constraints that ensure the solution space #' conforms to linkage expectations known _a priori_. This general idea of #' using preliminary expectations to augment the clustering solution will be #' referred to as _semisupervised clustering_. If constraints are given in #' the call to `extractFOSC()`, the following alternative objective function #' is maximized: #' #' \deqn{J = \frac{1}{2n_c}\sum\limits_{j=1}^n \gamma (x_j)}{J = 1/(2 * nc) #' \sum \gamma(Xj)} #' #' \eqn{n_c}{nc} is the total number of constraints given and #' \eqn{\gamma(x_j)}{\gamma(Xj)} represents the number of constraints involving #' object \eqn{x_j}{Xj} that are satisfied. In the case of ties (such as #' solutions where no constraints were given), the unsupervised solution is #' used as a tiebreaker. See Campello et al (2013) for more details. #' #' As a third option, if one wishes to prioritize the degree at which the #' unsupervised and semisupervised solutions contribute to the overall optimal #' solution, the parameter \eqn{\alpha} can be set to enable the extraction of #' clusters that maximize the `mixed` objective function #' #' \deqn{J = \alpha S(C_i) + (1 - \alpha) \gamma(C_i))}{J = \alpha S(Ci) + (1 - #' \alpha) \gamma(Ci).} #' #' FOSC expects the pairwise constraints to be passed as either 1) an #' \eqn{n(n-1)/2} vector of integers representing the constraints, where 1 #' represents should-link, -1 represents should-not-link, and 0 represents no #' preference using the unsupervised solution (see below for examples). #' Alternatively, if only a few constraints are needed, a named list #' representing the (symmetric) adjacency list can be used, where the names #' correspond to indices of the points in the original data, and the values #' correspond to integer vectors of constraints (positive indices for #' should-link, negative indices for should-not-link). Again, see the examples #' section for a demonstration of this. #' #' The parameters to the input function correspond to the concepts discussed #' above. The `minPts` parameter to represent the minimum cluster size to #' extract. The optional `constraints` parameter contains the pairwise, #' instance-level constraints of the data. The optional `alpha` parameters #' controls whether the mixed objective function is used (if `alpha` is #' greater than 0). If the `validate_constraints` parameter is set to #' true, the constraints are checked (and fixed) for symmetry (if point A has a #' should-link constraint with point B, point B should also have the same #' constraint). Asymmetric constraints are not supported. #' #' Unstable branch pruning was not discussed by Campello et al (2013), however #' in some data sets it may be the case that specific subbranches scores are #' significantly greater than sibling and parent branches, and thus sibling #' branches should be considered as noise if their scores are cumulatively #' lower than the parents. This can happen in extremely nonhomogeneous data #' sets, where there exists locally very stable branches surrounded by unstable #' branches that contain more than `minPts` points. #' `prune_unstable = TRUE` will remove the unstable branches. #' #' @family clustering functions #' #' @param x a valid [hclust] object created via [hclust()] or [hdbscan()]. #' @param constraints Either a list or matrix of pairwise constraints. If #' missing, an unsupervised measure of stability is used to make local cuts and #' extract the optimal clusters. See details. #' @param alpha numeric; weight between \eqn{[0, 1]} for mixed-objective #' semi-supervised extraction. Defaults to 0. #' @param minPts numeric; Defaults to 2. Only needed if class-less noise is a #' valid label in the model. #' @param prune_unstable logical; should significantly unstable subtrees be #' pruned? The default is `FALSE` for the original optimal extraction #' framework (see Campello et al, 2013). See details for what `TRUE` #' implies. #' @param validate_constraints logical; should constraints be checked for #' validity? See details for what are considered valid constraints. #' #' @returns A list with the elements: #' #' \item{cluster }{A integer vector with cluster assignments. Zero #' indicates noise points (if any).} #' \item{hc }{The original [hclust] object with additional list elements #' `"stability"`, `"constraint"`, and `"total"` #' for the \eqn{n - 1} cluster-wide objective scores from the extraction.} #' #' @author Matt Piekenbrock #' @seealso [hclust()], [hdbscan()], [stats::cutree()] #' @references Campello, Ricardo JGB, Davoud Moulavi, Arthur Zimek, and Joerg #' Sander (2013). A framework for semi-supervised and unsupervised optimal #' extraction of clusters from hierarchies. _Data Mining and Knowledge #' Discovery_ 27(3): 344-371. #' \doi{10.1007/s10618-013-0311-4} #' @keywords model clustering #' @examples #' data("moons") #' #' ## Regular HDBSCAN using stability-based extraction (unsupervised) #' cl <- hdbscan(moons, minPts = 5) #' cl$cluster #' #' ## Constraint-based extraction from the HDBSCAN hierarchy #' ## (w/ stability-based tiebreaker (semisupervised)) #' cl_con <- extractFOSC(cl$hc, minPts = 5, #' constraints = list("12" = c(49, -47))) #' cl_con$cluster #' #' ## Alternative formulation: Constraint-based extraction from the HDBSCAN hierarchy #' ## (w/ stability-based tiebreaker (semisupervised)) using distance thresholds #' dist_moons <- dist(moons) #' cl_con2 <- extractFOSC(cl$hc, minPts = 5, #' constraints = ifelse(dist_moons < 0.1, 1L, #' ifelse(dist_moons > 1, -1L, 0L))) #' #' cl_con2$cluster # same as the second example #' @export extractFOSC <- function(x, constraints, alpha = 0, minPts = 2L, prune_unstable = FALSE, validate_constraints = FALSE) { if (!inherits(x, "hclust")) stop("extractFOSC expects 'x' to be a valid hclust object.") # if constraints are given then they need to be a list, a matrix or a vector if (!( missing(constraints) || is.list(constraints) || is.matrix(constraints) || is.numeric(constraints) )) stop("extractFOSC expects constraints to be either an adjacency list or adjacency matrix.") if (!minPts >= 2) stop("minPts must be at least 2.") if (alpha < 0 || alpha > 1) stop("alpha can only takes values between [0, 1].") n <- nrow(x$merge) + 1L ## First step for both unsupervised and semisupervised - compute stability scores cl_tree <- computeStability(x, minPts) ## Unsupervised Extraction if (missing(constraints)) { cl_tree <- extractUnsupervised(cl_tree, prune_unstable) } ## Semi-supervised Extraction else { ## If given as adjacency-list form if (is.list(constraints)) { ## Checks for proper indexing, symmetry of constraints, etc. if (validate_constraints) { is_valid <- max(as.integer(names(constraints))) < n is_valid <- is_valid && all(vapply(constraints, function(ilc) all(ilc <= n), logical(1L))) if (!is_valid) { stop("Detected constraint indices not in the interval [1, n]") } constraints <- validateConstraintList(constraints, n) } cl_tree <- extractSemiSupervised(cl_tree, constraints, alpha, prune_unstable) } ## Adjacency matrix given (probably from dist object), retrieve adjacency list form else if (is.vector(constraints)) { if (!all(constraints %in% c(-1, 0, 1))) { stop( "'extractFOSC' only accepts instance-level constraints. See ?extractFOSC for more details." ) } ## Checks for proper integer labels, symmetry of constraints, length of vector, etc. if (validate_constraints) { is_valid <- length(constraints) == choose(n, 2) constraints_list <- validateConstraintList(distToAdjacency(constraints, n), n) } else { constraints_list <- distToAdjacency(constraints, n) } cl_tree <- extractSemiSupervised(cl_tree, constraints_list, alpha, prune_unstable) } ## Full nxn adjacency-matrix given, give warning and retrieve adjacency list form else if (is.matrix(constraints)) { if (!all(constraints %in% c(-1, 0, 1))) { stop( "'extractFOSC' only accepts instance-level constraints. See ?extractFOSC for more details." ) } if (!all(dim(constraints) == c(n, n))) { stop("Given matrix is not square.") } warning( "Full nxn matrix given; extractFOCS does not support asymmetric relational constraints. Using lower triangular." ) constraints <- constraints[lower.tri(constraints)] ## Checks for proper integer labels, symmetry of constraints, length of vector, etc. if (validate_constraints) { is_valid <- length(constraints) == choose(n, 2) constraints_list <- validateConstraintList(distToAdjacency(constraints, n), n) } else { constraints_list <- distToAdjacency(constraints, n) } cl_tree <- extractSemiSupervised(cl_tree, constraints_list, alpha, prune_unstable) } else { stop( "'extractFOSC' doesn't know how to handle constraints of type ", class(constraints) ) } } cl_track <- attr(cl_tree, "cl_tracker") stability_score <- vapply(cl_track, function(cid) cl_tree[[as.character(cid)]]$stability, numeric(1L)) constraint_score <- vapply(cl_track, function(cid) cl_tree[[as.character(cid)]]$vscore %||% 0, numeric(1L)) total_score <- vapply(cl_track, function(cid) cl_tree[[as.character(cid)]]$score %||% 0, numeric(1L)) out <- append( x, list( cluster = cl_track, stability = stability_score, constraint = constraint_score, total = total_score ) ) extraction_type <- if (missing(constraints)) { "(w/ stability-based extraction)" } else if (alpha == 0) { "(w/ constraint-based extraction)" } else { "(w/ mixed-objective extraction)" } substrs <- strsplit(x$method, split = " \\(w\\/")[[1L]] out[["method"]] <- if (length(substrs) > 1) paste(substrs[[1]], extraction_type) else paste(out[["method"]], extraction_type) class(out) <- "hclust" return(list(cluster = attr(cl_tree, "cluster"), hc = out)) } ================================================ FILE: R/frNN.R ================================================ ####################################################################### # dbscan - Density Based Clustering of Applications with Noise # and Related Algorithms # Copyright (C) 2015 Michael Hahsler # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License along # with this program; if not, write to the Free Software Foundation, Inc., # 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. #' Find the Fixed Radius Nearest Neighbors #' #' This function uses a kd-tree to find the fixed radius nearest neighbors #' (including distances) fast. #' #' If `x` is specified as a data matrix, then Euclidean distances an fast #' nearest neighbor lookup using a kd-tree are used. #' #' To create a frNN object from scratch, you need to supply at least the #' elements `id` with a list of integer vectors with the nearest neighbor #' ids for each point and `eps` (see below). #' #' **Self-matches:** Self-matches are not returned! #' #' @aliases frNN frnn print.frnn #' @family NN functions #' #' @param x a data matrix, a dist object or a frNN object. #' @param eps neighbors radius. #' @param query a data matrix with the points to query. If query is not #' specified, the NN for all the points in `x` is returned. If query is #' specified then `x` needs to be a data matrix. #' @param sort sort the neighbors by distance? This is expensive and can be #' done later using `sort()`. #' @param search nearest neighbor search strategy (one of `"kdtree"`, `"linear"` or #' `"dist"`). #' @param bucketSize max size of the kd-tree leafs. #' @param splitRule rule to split the kd-tree. One of `"STD"`, `"MIDPT"`, `"FAIR"`, #' `"SL_MIDPT"`, `"SL_FAIR"` or `"SUGGEST"` (SL stands for sliding). `"SUGGEST"` uses #' ANNs best guess. #' @param approx use approximate nearest neighbors. All NN up to a distance of #' a factor of `1 + approx` eps may be used. Some actual NN may be omitted #' leading to spurious clusters and noise points. However, the algorithm will #' enjoy a significant speedup. #' @param decreasing sort in decreasing order? #' @param ... further arguments #' #' @returns #' #' `frNN()` returns an object of class [frNN] (subclass of #' [NN]) containing a list with the following components: #' \item{id }{a list of #' integer vectors. Each vector contains the ids (row numbers) of the fixed radius nearest #' neighbors. } #' \item{dist }{a list with distances (same structure as #' `id`). } #' \item{eps }{ neighborhood radius `eps` that was used. } #' \item{metric }{ used distance metric. } #' #' `adjacencylist()` returns a list with one entry per data point in `x`. Each entry #' contains the id of the nearest neighbors. #' #' @author Michael Hahsler #' #' @references David M. Mount and Sunil Arya (2010). ANN: A Library for #' Approximate Nearest Neighbor Searching, #' \url{http://www.cs.umd.edu/~mount/ANN/}. #' @keywords model #' @examples #' data(iris) #' x <- iris[, -5] #' #' # Example 1: Find fixed radius nearest neighbors for each point #' nn <- frNN(x, eps = .5) #' nn #' #' # Number of neighbors #' hist(lengths(adjacencylist(nn)), #' xlab = "k", main="Number of Neighbors", #' sub = paste("Neighborhood size eps =", nn$eps)) #' #' # Explore neighbors of point i = 10 #' i <- 10 #' nn$id[[i]] #' nn$dist[[i]] #' plot(x, col = ifelse(seq_len(nrow(iris)) %in% nn$id[[i]], "red", "black")) #' #' # get an adjacency list #' head(adjacencylist(nn)) #' #' # plot the fixed radius neighbors (and then reduced to a radius of .3) #' plot(nn, x) #' plot(frNN(nn, eps = .3), x) #' #' ## Example 2: find fixed-radius NN for query points #' q <- x[c(1,100),] #' nn <- frNN(x, eps = .5, query = q) #' #' plot(nn, x, col = "grey") #' points(q, pch = 3, lwd = 2) #' @export frNN frNN <- function(x, eps, query = NULL, sort = TRUE, search = "kdtree", bucketSize = 10, splitRule = "suggest", approx = 0) { if (is.null(eps) || is.na(eps) || eps < 0) stop("eps needs to be >=0.") if (inherits(x, "frNN")) { if (x$eps < eps) stop("frNN in x has not a sufficient eps radius.") for (i in seq_along(x$dist)) { take <- x$dist[[i]] <= eps x$dist[[i]] <- x$dist[[i]][take] x$id[[i]] <- x$id[[i]][take] } x$eps <- eps return(x) } search <- .parse_search(search) splitRule <- .parse_splitRule(splitRule) ### dist search if (search == 3 && !inherits(x, "dist")) { if (.matrixlike(x)) x <- dist(x) else stop("x needs to be a matrix to calculate distances") } ### get kNN from a dist object in R if (inherits(x, "dist")) { if (!is.null(query)) stop("query can only be used if x contains the data.") if (anyNA(x)) stop("data/distances cannot contain NAs for frNN (with kd-tree)!") return(dist_to_frNN(x, eps = eps, sort = sort)) } ## make sure x is numeric if (!.matrixlike(x)) stop("x needs to be a matrix or a data.frame.") x <- as.matrix(x) if (storage.mode(x) == "integer") storage.mode(x) <- "double" if (storage.mode(x) != "double") stop("all data in x has to be numeric.") if (!is.null(query)) { if (!.matrixlike(query)) stop("query needs to be a matrix or a data.frame.") query <- as.matrix(query) if (storage.mode(query) == "integer") storage.mode(query) <- "double" if (storage.mode(query) != "double") stop("query has to be NULL or a numeric matrix or data.frame.") if (ncol(x) != ncol(query)) stop("x and query need to have the same number of columns!") } if (anyNA(x)) stop("data/distances cannot contain NAs for frNN (with kd-tree)!") ## returns NO self matches if (!is.null(query)) { ret <- frNN_query_int( as.matrix(x), as.matrix(query), as.double(eps), as.integer(search), as.integer(bucketSize), as.integer(splitRule), as.double(approx) ) names(ret$dist) <- rownames(query) names(ret$id) <- rownames(query) ret$metric <- "euclidean" } else { ret <- frNN_int( as.matrix(x), as.double(eps), as.integer(search), as.integer(bucketSize), as.integer(splitRule), as.double(approx) ) names(ret$dist) <- rownames(x) names(ret$id) <- rownames(x) ret$metric <- "euclidean" } ret$eps <- eps ret$sort <- FALSE class(ret) <- c("frNN", "NN") if (sort) ret <- sort.frNN(ret) ret } # extract a row from a distance matrix without doubling space requirements dist_row <- function(x, i, self_val = 0) { n <- attr(x, "Size") i <- rep(i, times = n) j <- seq_len(n) swap_idx <- i > j tmp <- i[swap_idx] i[swap_idx] <- j[swap_idx] j[swap_idx] <- tmp diag_idx <- i == j idx <- n * (i - 1) - i * (i - 1) / 2 + j - i idx[diag_idx] <- NA val <- x[idx] val[diag_idx] <- self_val val } dist_to_frNN <- function(x, eps, sort = FALSE) { .check_dist(x) n <- attr(x, "Size") id <- list() d <- list() for (i in seq_len(n)) { ### Inf -> no self-matches y <- dist_row(x, i, self_val = Inf) o <- which(y <= eps) id[[i]] <- o d[[i]] <- y[o] } names(id) <- labels(x) names(d) <- labels(x) ret <- structure(list( dist = d, id = id, eps = eps, metric = attr(x, "method"), sort = FALSE ), class = c("frNN", "NN")) if (sort) ret <- sort.frNN(ret) return(ret) } #' @rdname frNN #' @export sort.frNN <- function(x, decreasing = FALSE, ...) { if (isTRUE(x$sort)) return(x) if (is.null(x$dist)) stop("Unable to sort. Distances are missing.") ## FIXME: This is slow do this in C++ n <- names(x$id) o <- lapply( seq_along(x$dist), FUN = function(i) order(x$dist[[i]], x$id[[i]], decreasing = decreasing) ) x$dist <- lapply( seq_along(o), FUN = function(p) x$dist[[p]][o[[p]]] ) x$id <- lapply( seq_along(o), FUN = function(p) x$id[[p]][o[[p]]] ) names(x$dist) <- n names(x$id) <- n x$sort <- TRUE x } #' @rdname frNN #' @export adjacencylist.frNN <- function(x, ...) x$id #' @rdname frNN #' @export print.frNN <- function(x, ...) { cat( "fixed radius nearest neighbors for ", length(x$id), " objects (eps=", x$eps, ").", "\n", sep = "" ) cat("Distance metric:", x$metric, "\n") cat("\nAvailable fields: ", toString(names(x)), "\n", sep = "") } ================================================ FILE: R/hdbscan.R ================================================ ####################################################################### # dbscan - Density Based Clustering of Applications with Noise # and Related Algorithms # Copyright (C) 2015 Michael Hahsler, Matt Piekenbrock # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License along # with this program; if not, write to the Free Software Foundation, Inc., # 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. #' Hierarchical DBSCAN (HDBSCAN) #' #' Fast C++ implementation of the HDBSCAN (Hierarchical DBSCAN) and its related #' algorithms. #' #' This fast implementation of HDBSCAN (Campello et al., 2013) computes the #' hierarchical cluster tree representing density estimates along with the #' stability-based flat cluster extraction. HDBSCAN essentially computes the #' hierarchy of all DBSCAN* clusterings, and #' then uses a stability-based extraction method to find optimal cuts in the #' hierarchy, thus producing a flat solution. #' #' HDBSCAN performs the following steps: #' #' 1. Compute mutual reachability distance mrd between points #' (based on distances and core distances). #' 2. Use mdr as a distance measure to construct a minimum spanning tree. #' 3. Prune the tree using stability. #' 4. Extract the clusters. #' #' Additional, related algorithms including the "Global-Local Outlier Score #' from Hierarchies" (GLOSH; see section 6 of Campello et al., 2015) #' is available in function [glosh()] #' and the ability to cluster based on instance-level constraints (see #' section 5.3 of Campello et al. 2015) are supported. The algorithms only need #' the parameter `minPts`. #' #' Note that `minPts` not only acts as a minimum cluster size to detect, #' but also as a "smoothing" factor of the density estimates implicitly #' computed from HDBSCAN. #' #' When using the optional parameter `cluster_selection_epsilon`, #' a combination between DBSCAN* and HDBSCAN* can be achieved #' (see Malzer & Baum 2020). This means that part of the #' tree is affected by `cluster_selection_epsilon` as if #' running DBSCAN* with `eps` = `cluster_selection_epsilon`. #' The remaining part (on levels above the threshold) is still #' processed by HDBSCAN*'s stability-based selection algorithm #' and can therefore return clusters of variable densities. #' Note that there is not always a remaining part, especially if #' the parameter value is chosen too large, or if there aren't #' enough clusters of variable densities. In this case, the result #' will be equal to DBSCAN*. # `cluster_selection_epsilon` is especially useful for cases #' where HDBSCAN* produces too many small clusters that #' need to be merged, while still being able to extract clusters #' of variable densities at higher levels. #' #' `coredist()`: The core distance is defined for each point as #' the distance to the `MinPts - 1`'s neighbor. #' It is a density estimate equivalent to `kNNdist()` with `k = MinPts -1`. #' #' `mrdist()`: The mutual reachability distance is defined between two points as #' `mrd(a, b) = max(coredist(a), coredist(b), dist(a, b))`. This distance metric is used by #' HDBSCAN. It has the effect of increasing distances in low density areas. #' #' `predict()` assigns each new data point to the same cluster as the nearest point #' if it is not more than that points core distance away. Otherwise the new point #' is classified as a noise point (i.e., cluster ID 0). #' @aliases hdbscan HDBSCAN print.hdbscan #' #' @family HDBSCAN functions #' @family clustering functions #' #' @param x a data matrix (Euclidean distances are used) or a [dist] object #' calculated with an arbitrary distance metric. #' @param minPts integer; Minimum size of clusters. See details. #' @param cluster_selection_epsilon double; a distance threshold below which # no clusters should be selected (see Malzer & Baum 2020) #' @param gen_hdbscan_tree logical; should the robust single linkage tree be #' explicitly computed (see cluster tree in Chaudhuri et al, 2010). #' @param gen_simplified_tree logical; should the simplified hierarchy be #' explicitly computed (see Campello et al, 2013). #' @param verbose report progress. #' @param ... additional arguments are passed on. #' @param scale integer; used to scale condensed tree based on the graphics #' device. Lower scale results in wider colored trees lines. #' The default `'suggest'` sets scale to the number of clusters. #' @param gradient character vector; the colors to build the condensed tree #' coloring with. #' @param show_flat logical; whether to draw boxes indicating the most stable #' clusters. #' @param coredist numeric vector with precomputed core distances (optional). #' #' @return `hdbscan()` returns object of class `hdbscan` with the following components: #' \item{cluster }{A integer vector with cluster assignments. Zero indicates #' noise points.} #' \item{minPts }{ value of the `minPts` parameter.} #' \item{cluster_scores }{The sum of the stability scores for each salient #' (flat) cluster. Corresponds to cluster IDs given the in `"cluster"` element. #' } #' \item{membership_prob }{The probability or individual stability of a #' point within its clusters. Between 0 and 1.} #' \item{outlier_scores }{The GLOSH outlier score of each point. } #' \item{hc }{An [hclust] object of the HDBSCAN hierarchy. } #' #' `coredist()` returns a vector with the core distance for each data point. #' #' `mrdist()` returns a [dist] object containing pairwise mutual reachability distances. #' #' @author Matt Piekenbrock #' @author Claudia Malzer (added cluster_selection_epsilon) #' #' @references #' Campello RJGB, Moulavi D, Sander J (2013). Density-Based Clustering Based on #' Hierarchical Density Estimates. Proceedings of the 17th Pacific-Asia #' Conference on Knowledge Discovery in Databases, PAKDD 2013, _Lecture Notes #' in Computer Science_ 7819, p. 160. #' \doi{10.1007/978-3-642-37456-2_14} #' #' Campello RJGB, Moulavi D, Zimek A, Sander J (2015). Hierarchical density #' estimates for data clustering, visualization, and outlier detection. #' _ACM Transactions on Knowledge Discovery from Data (TKDD),_ 10(5):1-51. #' \doi{10.1145/2733381} #' #' Malzer, C., & Baum, M. (2020). A Hybrid Approach To Hierarchical #' Density-based Cluster Selection. #' In 2020 IEEE International Conference on Multisensor Fusion #' and Integration for Intelligent Systems (MFI), pp. 223-228. #' \doi{10.1109/MFI49285.2020.9235263} #' @keywords model clustering hierarchical #' @examples #' ## cluster the moons data set with HDBSCAN #' data(moons) #' #' res <- hdbscan(moons, minPts = 5) #' res #' #' plot(res) #' clplot(moons, res) #' #' ## cluster the moons data set with HDBSCAN using Manhattan distances #' res <- hdbscan(dist(moons, method = "manhattan"), minPts = 5) #' plot(res) #' clplot(moons, res) #' #' ## Example for HDBSCAN(e) using cluster_selection_epsilon #' # data with clusters of various densities. #' X <- data.frame( #' x = c( #' 0.08, 0.46, 0.46, 2.95, 3.50, 1.49, 6.89, 6.87, 0.21, 0.15, #' 0.15, 0.39, 0.80, 0.80, 0.37, 3.63, 0.35, 0.30, 0.64, 0.59, 1.20, 1.22, #' 1.42, 0.95, 2.70, 6.36, 6.36, 6.36, 6.60, 0.04, 0.71, 0.57, 0.24, 0.24, #' 0.04, 0.04, 1.35, 0.82, 1.04, 0.62, 0.26, 5.98, 1.67, 1.67, 0.48, 0.15, #' 6.67, 6.67, 1.20, 0.21, 3.99, 0.12, 0.19, 0.15, 6.96, 0.26, 0.08, 0.30, #' 1.04, 1.04, 1.04, 0.62, 0.04, 0.04, 0.04, 0.82, 0.82, 1.29, 1.35, 0.46, #' 0.46, 0.04, 0.04, 5.98, 5.98, 6.87, 0.37, 6.47, 6.47, 6.47, 6.67, 0.30, #' 1.49, 3.21, 3.21, 0.75, 0.75, 0.46, 0.46, 0.46, 0.46, 3.63, 0.39, 3.65, #' 4.09, 4.01, 3.36, 1.43, 3.28, 5.94, 6.35, 6.87, 5.60, 5.99, 0.12, 0.00, #' 0.32, 0.39, 0.00, 1.63, 1.36, 5.67, 5.60, 5.79, 1.10, 2.99, 0.39, 0.18 #' ), #' y = c( #' 7.41, 8.01, 8.01, 5.44, 7.11, 7.13, 1.83, 1.83, 8.22, 8.08, #' 8.08, 7.20, 7.83, 7.83, 8.29, 5.99, 8.32, 8.22, 7.38, 7.69, 8.22, 7.31, #' 8.25, 8.39, 6.34, 0.16, 0.16, 0.16, 1.66, 7.55, 7.90, 8.18, 8.32, 8.32, #' 7.97, 7.97, 8.15, 8.43, 7.83, 8.32, 8.29, 1.03, 7.27, 7.27, 8.08, 7.27, #' 0.79, 0.79, 8.22, 7.73, 6.62, 7.62, 8.39, 8.36, 1.73, 8.29, 8.04, 8.22, #' 7.83, 7.83, 7.83, 8.32, 8.11, 7.69, 7.55, 7.20, 7.20, 8.01, 8.15, 7.55, #' 7.55, 7.97, 7.97, 1.03, 1.03, 1.24, 7.20, 0.47, 0.47, 0.47, 0.79, 8.22, #' 7.13, 6.48, 6.48, 7.10, 7.10, 8.01, 8.01, 8.01, 8.01, 5.99, 8.04, 5.22, #' 5.82, 5.14, 4.81, 7.62, 5.73, 0.55, 1.31, 0.05, 0.95, 1.59, 7.99, 7.48, #' 8.38, 7.12, 2.01, 1.40, 0.00, 9.69, 9.47, 9.25, 2.63, 6.89, 0.56, 3.11 #' ) #' ) #' #' ## HDBSCAN splits one cluster #' hdb <- hdbscan(X, minPts = 3) #' plot(hdb, show_flat = TRUE) #' hullplot(X, hdb, main = "HDBSCAN") #' #' ## DBSCAN* marks the least dense cluster as outliers #' db <- dbscan(X, eps = 1, minPts = 3, borderPoints = FALSE) #' hullplot(X, db, main = "DBSCAN*") #' #' ## HDBSCAN(e) mixes HDBSCAN AND DBSCAN* to find all clusters #' hdbe <- hdbscan(X, minPts = 3, cluster_selection_epsilon = 1) #' plot(hdbe, show_flat = TRUE) #' hullplot(X, hdbe, main = "HDBSCAN(e)") #' @export hdbscan <- function(x, minPts, cluster_selection_epsilon = 0.0, gen_hdbscan_tree = FALSE, gen_simplified_tree = FALSE, verbose = FALSE) { if (!inherits(x, "dist") && !.matrixlike(x)) { stop("hdbscan expects a numeric matrix or a dist object.") } ## 1. Calculate the mutual reachability between points if (verbose) { cat("Calculating core distances...\n") } coredist <- coredist(x, minPts) if (verbose) { cat("Calculating the mutual reachability matrix distances...\n") } mrd <- mrdist(x, minPts, coredist = coredist) n <- attr(mrd, "Size") ## 2. Construct a minimum spanning tree and convert to RSL representation if (verbose) { cat("Constructing the minimum spanning tree...\n") } mst <- mst(mrd, n) hc <- hclustMergeOrder(mst, order(mst[, 3])) hc$call <- match.call() ## 3. Prune the tree ## Process the hierarchy to retrieve all the necessary info needed by HDBSCAN if (verbose) { cat("Tree pruning...\n") } res <- computeStability(hc, minPts, compute_glosh = TRUE) res <- extractUnsupervised(res, cluster_selection_epsilon = cluster_selection_epsilon) cl <- attr(res, "cluster") ## 4. Extract the clusters if (verbose) { cat("Extract clusters...\n") } sl <- attr(res, "salient_clusters") ## Generate membership 'probabilities' using core distance as the measure of density prob <- rep(0, length(cl)) for (cid in sl) { max_f <- max(coredist[which(cl == cid)]) pr <- (max_f - coredist[which(cl == cid)]) / max_f prob[cl == cid] <- pr } ## Match cluster assignments to be incremental, with 0 representing noise if (any(cl == 0)) { cluster <- match(cl, c(0, sl)) - 1 } else { cluster <- match(cl, sl) } cl_map <- structure(sl, names = unique(cluster[hc$order][cluster[hc$order] != 0])) ## Stability scores ## NOTE: These scores represent the stability scores -before- the hierarchy traversal cluster_scores <- vapply(sl, function(sl_cid) { res[[as.character(sl_cid)]]$stability }, numeric(1L)) names(cluster_scores) <- names(cl_map) ## Return everything HDBSCAN does attr(res, "cl_map") <- cl_map # Mapping of hierarchical IDS to 'normalized' incremental ids out <- structure( list( cluster = cluster, minPts = minPts, coredist = coredist, cluster_scores = cluster_scores, # (Cluster-wide cumulative) Stability Scores membership_prob = prob, # Individual point membership probabilities outlier_scores = attr(res, "glosh"), # Outlier Scores hc = hc # Hclust object of MST (can be cut for quick assignments) ), class = "hdbscan", hdbscan = res ) # hdbscan attributes contains actual HDBSCAN hierarchy ## The trees don't need to be explicitly computed, but they may be useful if the user wants them if (gen_hdbscan_tree) { out$hdbscan_tree <- buildDendrogram(hc) } if (gen_simplified_tree) { out$simplified_tree <- simplifiedTree(res) } return(out) } #' @rdname hdbscan #' @export print.hdbscan <- function(x, ...) { writeLines(c( paste0("HDBSCAN clustering for ", nobs(x), " objects."), paste0("Parameters: minPts = ", x$minPts), paste0( "The clustering contains ", ncluster(x), " cluster(s) and ", nnoise(x), " noise points." ) )) print(table(x$cluster)) cat("\n") writeLines(strwrap(paste0("Available fields: ", toString(names( x ))), exdent = 18)) } #' @rdname hdbscan #' @param leaflab a string specifying how leaves are labeled (see [stats::plot.dendrogram()]). #' @param ylab the label for the y axis. #' @param main Title of the plot. #' @export plot.hdbscan <- function(x, scale = "suggest", gradient = c("yellow", "red"), show_flat = FALSE, main = "HDBSCAN*", ylab = "eps value", leaflab = "none", ...) { ## Logic checks if (!(scale == "suggest" || scale > 0)) { stop("scale parameter must be greater than 0.") } ## Main information needed hd_info <- attr(x, "hdbscan") dend <- x$simplified_tree %||% simplifiedTree(hd_info) coords <- node_xy(hd_info, cl_hierarchy = attr(hd_info, "cl_hierarchy")) ## Variables to help setup the scaling of the plotting nclusters <- length(hd_info) npoints <- length(x$cluster) nleaves <- length(all_children( attr(hd_info, "cl_hierarchy"), key = 0, leaves_only = TRUE )) scale <- ifelse(scale == "suggest", nclusters, nclusters / scale) ## Color variables col_breaks <- seq(0, length(x$cluster) + nclusters, by = nclusters) gcolors <- grDevices::colorRampPalette(gradient)(length(col_breaks)) ## Depth-first search to recursively plot rectangles eps_dfs <- function(dend, index, parent_height, scale) { coord <- coords[index, ] cl_key <- as.character(attr(dend, "label")) ## widths == number of points in the cluster at each eps it was alive widths <- vapply(sort(hd_info[[cl_key]]$eps, decreasing = TRUE), function(eps) { sum(hd_info[[cl_key]]$eps <= eps) }, numeric(1L)) if (length(widths) > 0) { widths <- c(widths + hd_info[[cl_key]]$n_children, rep(hd_info[[cl_key]]$n_children, hd_info[[cl_key]]$n_children)) } else { widths <- rep(hd_info[[cl_key]]$n_children, hd_info[[cl_key]]$n_children) } ## Normalize and scale widths to length of x-axis normalize <- function(x) { (nleaves) * (x - 1) / (npoints - 1) } xleft <- coord[[1]] - normalize(widths) / scale xright <- coord[[1]] + normalize(widths) / scale ## Top is always parent height, bottom is when the points died ## Minor adjustment made if at the root equivalent to plot.dendrogram(edge.root=T) if (cl_key == "0") { ytop <- rep(hd_info[[cl_key]]$eps_birth + 0.0625 * hd_info[[cl_key]]$eps_birth, length(widths)) ybottom <- rep(hd_info[[cl_key]]$eps_death, length(widths)) } else { ytop <- rep(parent_height, length(widths)) ybottom <- c( sort(hd_info[[cl_key]]$eps, decreasing = TRUE), rep(hd_info[[cl_key]]$eps_death, hd_info[[cl_key]]$n_children) ) } ## Draw the rectangles rect_color <- gcolors[.bincode(length(widths), breaks = col_breaks)] graphics::rect( xleft = xleft, xright = xright, ybottom = ybottom, ytop = ytop, col = rect_color, border = NA, lwd = 0 ) ## Highlight the most 'stable' clusters returned by the default flat cluster extraction if (show_flat) { salient_cl <- attr(hd_info, "salient_clusters") if (as.integer(attr(dend, "label")) %in% salient_cl) { x_adjust <- (max(xright) - min(xleft)) * 0.10 # 10% left/right border y_adjust <- (max(ytop) - min(ybottom)) * 0.025 # 2.5% above/below border graphics::rect( xleft = min(xleft) - x_adjust, xright = max(xright) + x_adjust, ybottom = min(ybottom) - y_adjust, ytop = max(ytop) + y_adjust, border = "red", lwd = 1 ) n_label <- names(which(attr(hd_info, "cl_map") == attr(dend, "label"))) text( x = coord[[1]], y = min(ybottom), pos = 1, labels = n_label ) } } ## Recurse in depth-first-manner if (is.leaf(dend)) { return(index) } else { left <- eps_dfs( dend[[1]], index = index + 1, parent_height = attr(dend, "height"), scale = scale ) right <- eps_dfs( dend[[2]], index = left + 1, parent_height = attr(dend, "height"), scale = scale ) return(right) } } ## Run the recursive plotting plot( dend, edge.root = TRUE, main = main, ylab = ylab, leaflab = leaflab, ... ) eps_dfs(dend, index = 1, parent_height = 0, scale = scale) return(invisible(x)) } #' @rdname hdbscan #' @export coredist <- function(x, minPts) kNNdist(x, k = minPts - 1) #' @rdname hdbscan #' @export mrdist <- function(x, minPts, coredist = NULL) { if (inherits(x, "dist")) { .check_dist(x) x_dist <- x } else { x_dist <- dist(x, method = "euclidean", diag = FALSE, upper = FALSE) } if (is.null(coredist)) { coredist <- coredist(x, minPts) } # mr_dist <- as.vector(pmax(as.dist(outer(coredist, coredist, pmax)), x_dist)) # much faster in C++ mr_dist <- mrd(x_dist, coredist) class(mr_dist) <- "dist" attr(mr_dist, "Size") <- attr(x_dist, "Size") attr(mr_dist, "Diag") <- FALSE attr(mr_dist, "Upper") <- FALSE attr(mr_dist, "method") <- paste0("mutual reachability (", attr(x_dist, "method"), ")") mr_dist } ================================================ FILE: R/hullplot.R ================================================ ####################################################################### # dbscan - Density Based Clustering of Applications with Noise # and Related Algorithms # Copyright (C) 2015 Michael Hahsler # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License along # with this program; if not, write to the Free Software Foundation, Inc., # 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. #' Plot Clusters #' #' This function produces a two-dimensional scatter plot of data points #' and colors the data points according to a supplied clustering. Noise points #' are marked as `x`. `hullplot()` also adds convex hulls to clusters. #' #' @name hullplot #' @aliases hullplot clplot #' #' @param x a data matrix. If more than 2 columns are provided, then the data #' is plotted using the first two principal components. #' @param cl a clustering. Either a numeric cluster assignment vector or a #' clustering object (a list with an element named `cluster`). #' @param col colors used for clusters. Defaults to the standard palette. The #' first color (default is black) is used for noise/unassigned points (cluster #' id 0). #' @param pch a vector of plotting characters. By default `o` is used for #' points and `x` for noise points. #' @param cex expansion factor for symbols. #' @param hull_lwd,hull_lty line width and line type used for the convex hull. #' @param main main title. #' @param solid,alpha draw filled polygons instead of just lines for the convex #' hulls? alpha controls the level of alpha shading. #' @param ... additional arguments passed on to plot. #' @author Michael Hahsler #' @keywords plot clustering #' @examples #' set.seed(2) #' n <- 400 #' #' x <- cbind( #' x = runif(4, 0, 1) + rnorm(n, sd = 0.1), #' y = runif(4, 0, 1) + rnorm(n, sd = 0.1) #' ) #' cl <- rep(1:4, times = 100) #' #' #' ### original data with true clustering #' clplot(x, cl, main = "True clusters") #' hullplot(x, cl, main = "True clusters") #' ### use different symbols #' hullplot(x, cl, main = "True clusters", pch = cl) #' ### just the hulls #' hullplot(x, cl, main = "True clusters", pch = NA) #' ### a version suitable for b/w printing) #' hullplot(x, cl, main = "True clusters", solid = FALSE, #' col = c("grey", "black"), pch = cl) #' #' #' ### run some clustering algorithms and plot the results #' db <- dbscan(x, eps = .07, minPts = 10) #' clplot(x, db, main = "DBSCAN") #' hullplot(x, db, main = "DBSCAN") #' #' op <- optics(x, eps = 10, minPts = 10) #' opDBSCAN <- extractDBSCAN(op, eps_cl = .07) #' hullplot(x, opDBSCAN, main = "OPTICS") #' #' opXi <- extractXi(op, xi = 0.05) #' hullplot(x, opXi, main = "OPTICSXi") #' #' # Extract minimal 'flat' clusters only #' opXi <- extractXi(op, xi = 0.05, minimum = TRUE) #' hullplot(x, opXi, main = "OPTICSXi") #' #' km <- kmeans(x, centers = 4) #' hullplot(x, km, main = "k-means") #' #' hc <- cutree(hclust(dist(x)), k = 4) #' hullplot(x, hc, main = "Hierarchical Clustering") #' @export hullplot <- function(x, cl, col = NULL, pch = NULL, cex = 0.5, hull_lwd = 1, hull_lty = 1, solid = TRUE, alpha = .2, main = "Convex Cluster Hulls", ...) { ### handle d>2 by using PCA if (ncol(x) > 2) x <- prcomp(x)$x ### extract clustering (keep hierarchical xICSXi structure) if (inherits(cl, "xics") || "clusters_xi" %in% names(cl)) { clusters_xi <- cl$clusters_xi cl_order <- cl$order } else clusters_xi <- NULL if (is.list(cl)) cl <- cl$cluster if (!is.numeric(cl)) stop("Could not get cluster assignment vector from cl.") #if(is.null(col)) col <- c("#000000FF", rainbow(n=max(cl))) if (is.null(col)) col <- palette() # Note: We use the first color for noise points if (length(col) == 1L) col <- c(col, col) col_noise <- col[1] col <- col[-1] if (max(cl) > length(col)) { warning("Not enough colors. Some colors will be reused.") col <- rep(col, length.out = max(cl)) } # mark noise points pch <- pch %||% ifelse(cl == 0L, 4L, 1L) plot(x[, 1:2], col = c(col_noise, col)[cl + 1L], pch = pch, cex = cex, main = main, ...) col_poly <- adjustcolor(col, alpha.f = alpha) border <- col ## no border? if (is.null(hull_lwd) || is.na(hull_lwd) || hull_lwd == 0) { hull_lwd <- 1 border <- NA } if (inherits(cl, "xics") || "clusters_xi" %in% names(cl)) { ## This is necessary for larger datasets: Ensure largest is plotted first clusters_xi <- clusters_xi[order(-(clusters_xi$end - clusters_xi$start)), ] # Order by size (descending) ci_order <- clusters_xi$cluster_id } else { ci_order <- 1:max(cl) } for (i in seq_along(ci_order)) { ### use all the points for xICSXi's hierarchical structure if (is.null(clusters_xi)) { d <- x[cl == i, , drop = FALSE] } else { d <- x[cl_order[clusters_xi$start[i]:clusters_xi$end[i]], , drop = FALSE] } ch <- chull(d) ch <- c(ch, ch[1]) if (!solid) { lines(d[ch, ], col = border[ci_order[i]], lwd = hull_lwd, lty = hull_lty) } else { polygon( d[ch, ], col = col_poly[ci_order[i]], lwd = hull_lwd, lty = hull_lty, border = border[ci_order[i]] ) } } } #' @rdname hullplot #' @export clplot <- function(x, cl, col = NULL, pch = NULL, cex = 0.5, main = "Cluster Plot", ...) hullplot(x, cl = cl, col = col, pch = pch, cex = cex, main = main, solid = FALSE, hull_lwd = NA) ================================================ FILE: R/jpclust.R ================================================ ####################################################################### # dbscan - Density Based Clustering of Applications with Noise # and Related Algorithms # Copyright (C) 2017 Michael Hahsler # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License along # with this program; if not, write to the Free Software Foundation, Inc., # 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. #' Jarvis-Patrick Clustering #' #' Fast C++ implementation of the Jarvis-Patrick clustering which first builds #' a shared nearest neighbor graph (k nearest neighbor sparsification) and then #' places two points in the same cluster if they are in each others nearest #' neighbor list and they share at least kt nearest neighbors. #' #' Following the original paper, the shared nearest neighbor list is #' constructed as the k neighbors plus the point itself (as neighbor zero). #' Therefore, the threshold `kt` needs to be in the range \eqn{[1, k]}. #' #' Fast nearest neighbors search with [kNN()] is only used if `x` is #' a matrix. In this case Euclidean distance is used. #' #' @aliases jpclust print.general_clustering #' @family clustering functions #' #' @param x a data matrix/data.frame (Euclidean distance is used), a #' precomputed [dist] object or a kNN object created with [kNN()]. #' @param k Neighborhood size for nearest neighbor sparsification. If `x` #' is a kNN object then `k` may be missing. #' @param kt threshold on the number of shared nearest neighbors (including the #' points themselves) to form clusters. Range: \eqn{[1, k]} #' @param ... additional arguments are passed on to the k nearest neighbor #' search algorithm. See [kNN()] for details on how to control the #' search strategy. #' #' @return A object of class `general_clustering` with the following #' components: #' \item{cluster }{A integer vector with cluster assignments. Zero #' indicates noise points.} #' \item{type }{ name of used clustering algorithm.} #' \item{metric }{ the distance metric used for clustering.} #' \item{param }{ list of used clustering parameters. } #' #' @author Michael Hahsler #' @references R. A. Jarvis and E. A. Patrick. 1973. Clustering Using a #' Similarity Measure Based on Shared Near Neighbors. _IEEE Trans. Comput. #' 22,_ 11 (November 1973), 1025-1034. #' \doi{10.1109/T-C.1973.223640} #' @keywords model clustering #' @examples #' data("DS3") #' #' # use a shared neighborhood of 20 points and require 12 shared neighbors #' cl <- jpclust(DS3, k = 20, kt = 12) #' cl #' #' clplot(DS3, cl) #' # Note: JP clustering does not consider noise and thus, #' # the sine wave points chain clusters together. #' #' # use a precomputed kNN object instead of the original data. #' nn <- kNN(DS3, k = 30) #' nn #' #' cl <- jpclust(nn, k = 20, kt = 12) #' cl #' #' # cluster with noise removed (use low pointdensity to identify noise) #' d <- pointdensity(DS3, eps = 25) #' hist(d, breaks = 20) #' DS3_noiseless <- DS3[d > 110,] #' #' cl <- jpclust(DS3_noiseless, k = 20, kt = 10) #' cl #' #' clplot(DS3_noiseless, cl) #' @export jpclust <- function(x, k, kt, ...) { # Create NN graph if (missing(k) && inherits(x, "kNN")) k <- x$k if (length(kt) != 1 || kt < 1 || kt > k) stop("kt needs to be a threshold in range [1, k].") nn <- kNN(x, k, sort = FALSE, ...) # Perform clustering cl <- JP_int(nn$id, kt = as.integer(kt)) structure( list( cluster = as.integer(factor(cl)), type = "Jarvis-Patrick clustering", metric = nn$metric, param = list(k = k, kt = kt) ), class = c("general_clustering") ) } #' @export print.general_clustering <- function(x, ...) { cl <- unique(x$cluster) cl <- length(cl[cl != 0L]) writeLines(c( paste0(x$type, " for ", length(x$cluster), " objects."), paste0("Parameters: ", paste( names(x$param), unlist(x$param, use.names = FALSE), sep = " = ", collapse = ", " )), paste0( "The clustering contains ", cl, " cluster(s) and ", sum(x$cluster == 0L), " noise points." ) )) print(table(x$cluster)) cat("\n") writeLines(strwrap(paste0( "Available fields: ", toString(names(x)) ), exdent = 18)) } ================================================ FILE: R/kNN.R ================================================ ####################################################################### # dbscan - Density Based Clustering of Applications with Noise # and Related Algorithms # Copyright (C) 2015 Michael Hahsler # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License along # with this program; if not, write to the Free Software Foundation, Inc., # 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. #' Find the k Nearest Neighbors #' #' This function uses a kd-tree to find all k nearest neighbors in a data #' matrix (including distances) fast. #' #' **Ties:** If the kth and the (k+1)th nearest neighbor are tied, then the #' neighbor found first is returned and the other one is ignored. #' #' **Self-matches:** If no query is specified, then self-matches are #' removed. #' #' Details on the search parameters: #' #' * `search` controls if #' a kd-tree or linear search (both implemented in the ANN library; see Mount #' and Arya, 2010). Note, that these implementations cannot handle NAs. #' `search = "dist"` precomputes Euclidean distances first using R. NAs are #' handled, but the resulting distance matrix cannot contain NAs. To use other #' distance measures, a precomputed distance matrix can be provided as `x` #' (`search` is ignored). #' #' * `bucketSize` and `splitRule` influence how the kd-tree is #' built. `approx` uses the approximate nearest neighbor search #' implemented in ANN. All nearest neighbors up to a distance of #' `eps / (1 + approx)` will be considered and all with a distance #' greater than `eps` will not be considered. The other points might be #' considered. Note that this results in some actual nearest neighbors being #' omitted leading to spurious clusters and noise points. However, the #' algorithm will enjoy a significant speedup. For more details see Mount and #' Arya (2010). #' #' @aliases kNN knn #' @family NN functions #' #' @param x a data matrix, a [dist] object or a [kNN] object. #' @param k number of neighbors to find. #' @param query a data matrix with the points to query. If query is not #' specified, the NN for all the points in `x` is returned. If query is #' specified then `x` needs to be a data matrix. #' @param search nearest neighbor search strategy (one of `"kdtree"`, `"linear"` or #' `"dist"`). #' @param sort sort the neighbors by distance? Note that some search methods #' already sort the results. Sorting is expensive and `sort = FALSE` may #' be much faster for some search methods. kNN objects can be sorted using #' `sort()`. #' @param bucketSize max size of the kd-tree leafs. #' @param splitRule rule to split the kd-tree. One of `"STD"`, `"MIDPT"`, `"FAIR"`, #' `"SL_MIDPT"`, `"SL_FAIR"` or `"SUGGEST"` (SL stands for sliding). `"SUGGEST"` uses #' ANNs best guess. #' @param approx use approximate nearest neighbors. All NN up to a distance of #' a factor of `1 + approx` eps may be used. Some actual NN may be omitted #' leading to spurious clusters and noise points. However, the algorithm will #' enjoy a significant speedup. #' @param decreasing sort in decreasing order? #' @param ... further arguments #' #' @return An object of class `kNN` (subclass of [NN]) containing a #' list with the following components: #' \item{dist }{a matrix with distances. } #' \item{id }{a matrix with `ids`. } #' \item{k }{number `k` used. } #' \item{metric }{ used distance metric. } #' #' @author Michael Hahsler #' @references David M. Mount and Sunil Arya (2010). ANN: A Library for #' Approximate Nearest Neighbor Searching, #' \url{http://www.cs.umd.edu/~mount/ANN/}. #' @keywords model #' @examples #' data(iris) #' x <- iris[, -5] #' #' # Example 1: finding kNN for all points in a data matrix (using a kd-tree) #' nn <- kNN(x, k = 5) #' nn #' #' # explore neighborhood of point 10 #' i <- 10 #' nn$id[i,] #' plot(x, col = ifelse(seq_len(nrow(iris)) %in% nn$id[i,], "red", "black")) #' #' # visualize the 5 nearest neighbors #' plot(nn, x) #' #' # visualize a reduced 2-NN graph #' plot(kNN(nn, k = 2), x) #' #' # Example 2: find kNN for query points #' q <- x[c(1,100),] #' nn <- kNN(x, k = 10, query = q) #' #' plot(nn, x, col = "grey") #' points(q, pch = 3, lwd = 2) #' #' # Example 3: find kNN using distances #' d <- dist(x, method = "manhattan") #' nn <- kNN(d, k = 1) #' plot(nn, x) #' @export kNN <- function(x, k, query = NULL, sort = TRUE, search = "kdtree", bucketSize = 10, splitRule = "suggest", approx = 0) { if (inherits(x, "kNN")) { if (x$k < k) stop("kNN in x has not enough nearest neighbors.") if (!x$sort) x <- sort(x) x$id <- x$id[, 1:k] if (!is.null(x$dist)) x$dist <- x$dist[, 1:k] if (!is.null(x$shared)) x$dist <- x$shared[, 1:k] x$k <- k return(x) } search <- .parse_search(search) splitRule <- .parse_splitRule(splitRule) k <- as.integer(k) if (k < 1) stop("Illegal k: needs to be k>=1!") ### dist search if (search == 3 && !inherits(x, "dist")) { if (.matrixlike(x)) x <- dist(x) else stop("x needs to be a matrix to calculate distances") } ### get kNN from a dist object if (inherits(x, "dist")) { if (!is.null(query)) stop("query can only be used if x contains a data matrix.") if (anyNA(x)) stop("distances cannot be NAs for kNN!") return(dist_to_kNN(x, k = k)) } ## make sure x is numeric if (!.matrixlike(x)) stop("x needs to be a matrix to calculate distances") x <- as.matrix(x) if (storage.mode(x) == "integer") storage.mode(x) <- "double" if (storage.mode(x) != "double") stop("x has to be a numeric matrix.") if (!is.null(query)) { query <- as.matrix(query) if (storage.mode(query) == "integer") storage.mode(query) <- "double" if (storage.mode(query) != "double") stop("query has to be NULL or a numeric matrix.") if (ncol(x) != ncol(query)) stop("x and query need to have the same number of columns!") } if (k >= nrow(x)) stop("Not enough neighbors in data set!") if (anyNA(x)) stop("data/distances cannot contain NAs for kNN (with kd-tree)!") ## returns NO self matches if (!is.null(query)) { ret <- kNN_query_int( as.matrix(x), as.matrix(query), as.integer(k), as.integer(search), as.integer(bucketSize), as.integer(splitRule), as.double(approx) ) dimnames(ret$dist) <- list(rownames(query), 1:k) dimnames(ret$id) <- list(rownames(query), 1:k) } else { ret <- kNN_int( as.matrix(x), as.integer(k), as.integer(search), as.integer(bucketSize), as.integer(splitRule), as.double(approx) ) dimnames(ret$dist) <- list(rownames(x), 1:k) dimnames(ret$id) <- list(rownames(x), 1:k) } class(ret) <- c("kNN", "NN") ### ANN already returns them sorted (by dist but not by ID) if (sort) ret <- sort(ret) ret$metric <- "euclidean" ret } # make sure we have a lower-triangle representation w/o diagonal .check_dist <- function(x) { if (!inherits(x, "dist")) stop("x needs to be a dist object") # cluster::dissimilarity does not have Diag or Upper attributes, but is a lower triangle # representation if (inherits(x, "dissimilarity")) return(TRUE) # check that dist objects have diag = FALSE, upper = FALSE if (attr(x, "Diag") || attr(x, "Upper")) stop("x needs to be a dist object with attributes Diag and Upper set to FALSE. Use as.dist(x, diag = FALSE, upper = FALSE) fist.") } dist_to_kNN <- function(x, k) { .check_dist(x) n <- attr(x, "Size") id <- structure(integer(n * k), dim = c(n, k)) d <- matrix(NA_real_, nrow = n, ncol = k) for (i in seq_len(n)) { ### Inf -> no self-matches y <- dist_row(x, i, self_val = Inf) o <- order(y, decreasing = FALSE) o <- o[seq_len(k)] id[i, ] <- o d[i, ] <- y[o] } dimnames(id) <- list(labels(x), seq_len(k)) dimnames(d) <- list(labels(x), seq_len(k)) ret <- structure(list( dist = d, id = id, k = k, sort = TRUE, metric = attr(x, "method") ), class = c("kNN", "NN")) return(ret) } #' @rdname kNN #' @export sort.kNN <- function(x, decreasing = FALSE, ...) { if (isTRUE(x$sort)) return(x) if (is.null(x$dist)) stop("Unable to sort. Distances are missing.") if (ncol(x$id) < 2) { x$sort <- TRUE return(x) } ## sort first by dist and break ties using id o <- vapply( seq_len(nrow(x$dist)), function(i) order(x$dist[i, ], x$id[i, ], decreasing = decreasing), integer(ncol(x$id)) ) for (i in seq_len(ncol(o))) { x$dist[i, ] <- x$dist[i, ][o[, i]] x$id[i, ] <- x$id[i, ][o[, i]] } x$sort <- TRUE x } #' @rdname kNN #' @export adjacencylist.kNN <- function(x, ...) lapply( seq_len(nrow(x$id)), FUN = function(i) { ## filter NAs tmp <- x$id[i, ] tmp[!is.na(tmp)] } ) #' @rdname kNN #' @export print.kNN <- function(x, ...) { cat("k-nearest neighbors for ", nrow(x$id), " objects (k=", x$k, ").", "\n", sep = "") cat("Distance metric:", x$metric, "\n") cat("\nAvailable fields: ", toString(names(x)), "\n", sep = "") } # Convert names to integers for C++ .parse_search <- function(search) { search <- pmatch(toupper(search), c("KDTREE", "LINEAR", "DIST")) if (is.na(search)) stop("Unknown NN search type!") search } .parse_splitRule <- function(splitRule) { splitRule <- pmatch(toupper(splitRule), .ANNsplitRule) - 1L if (is.na(splitRule)) stop("Unknown splitRule!") splitRule } ================================================ FILE: R/kNNdist.R ================================================ ####################################################################### # dbscan - Density Based Clustering of Applications with Noise # and Related Algorithms # Copyright (C) 2015 Michael Hahsler # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License along # with this program; if not, write to the Free Software Foundation, Inc., # 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. #' Calculate and Plot k-Nearest Neighbor Distances #' #' Fast calculation of the k-nearest neighbor distances for a dataset #' represented as a matrix of points. The kNN distance is defined as the #' distance from a point to its k nearest neighbor. The kNN distance plot #' displays the kNN distance of all points sorted from smallest to largest. The #' plot can be used to help find suitable parameter values for [dbscan()]. #' #' @family Outlier Detection Functions #' @family NN functions #' #' @param x the data set as a matrix of points (Euclidean distance is used) or #' a precalculated [dist] object. #' @param k number of nearest neighbors used for the distance calculation. For #' `kNNdistplot()` also a range of values for `k` or `minPts` can be specified. #' @param minPts to use a k-NN plot to determine a suitable `eps` value for [dbscan()], #' `minPts` used in dbscan can be specified and will set `k = minPts - 1`. #' @param all should a matrix with the distances to all k nearest neighbors be #' returned? #' @param ... further arguments (e.g., kd-tree related parameters) are passed #' on to [kNN()]. #' #' @return `kNNdist()` returns a numeric vector with the distance to its k #' nearest neighbor. If `all = TRUE` then a matrix with k columns #' containing the distances to all 1st, 2nd, ..., kth nearest neighbors is #' returned instead. #' #' @author Michael Hahsler #' @keywords model plot #' @examples #' data(iris) #' iris <- as.matrix(iris[, 1:4]) #' #' ## Find the 4-NN distance for each observation (see ?kNN #' ## for different search strategies) #' kNNdist(iris, k = 4) #' #' ## Get a matrix with distances to the 1st, 2nd, ..., 4th NN. #' kNNdist(iris, k = 4, all = TRUE) #' #' ## Produce a k-NN distance plot to determine a suitable eps for #' ## DBSCAN with MinPts = 5. Use k = 4 (= MinPts -1). #' ## The knee is visible around a distance of .7 #' kNNdistplot(iris, k = 4) #' #' ## Look at all k-NN distance plots for a k of 1 to 10 #' ## Note that k-NN distances are increasing in k #' kNNdistplot(iris, k = 1:20) #' #' cl <- dbscan(iris, eps = .7, minPts = 5) #' pairs(iris, col = cl$cluster + 1L) #' ## Note: black points are noise points #' @export kNNdist <- function(x, k, all = FALSE, ...) { kNNd <- kNN(x, k, sort = TRUE, ...)$dist if (!all) kNNd <- kNNd[, k] kNNd } #' @rdname kNNdist #' @export kNNdistplot <- function(x, k, minPts, ...) { if (missing(k) && missing(minPts)) stop("k or minPts need to be specified.") if (missing(k)) k <- minPts - 1 if (length(k) == 1) { kNNdist <- sort(kNNdist(x, k, ...)) plot( kNNdist, type = "l", ylab = paste0(k, "-NN distance"), xlab = "Points sorted by distance" ) } else { knnds <- vapply(k, function(i) sort(kNNdist(x, i, ...)), numeric(nrow(x))) matplot(knnds, type = "l", lty = 1, ylab = paste0("k-NN distance"), xlab = "Points sorted by distance") } } ================================================ FILE: R/moons.R ================================================ ####################################################################### # dbscan - Density Based Clustering of Applications with Noise # and Related Algorithms # Copyright (C) 2015 Michael Hahsler # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License along # with this program; if not, write to the Free Software Foundation, Inc., # 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. #' Moons Data #' #' Contains 100 2-d points, half of which are contained in two moons or #' "blobs"" (25 points each blob), and the other half in asymmetric facing #' crescent shapes. The three shapes are all linearly separable. #' #' This data was generated with the following Python commands using the #' SciKit-Learn library: #' #' `> import sklearn.datasets as data` #' #' `> moons = data.make_moons(n_samples=50, noise=0.05)` #' #' `> blobs = data.make_blobs(n_samples=50, centers=[(-0.75,2.25), (1.0, 2.0)], cluster_std=0.25)` #' #' `> test_data = np.vstack([moons, blobs])` #' #' @name moons #' @docType data #' @format A data frame with 100 observations on the following 2 variables. #' \describe{ #' \item{X}{a numeric vector} #' \item{Y}{a numeric vector} } #' @references Pedregosa, Fabian, Gael Varoquaux, Alexandre Gramfort, #' Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel et al. #' Scikit-learn: Machine learning in Python. _Journal of Machine Learning #' Research_ 12, no. Oct (2011): 2825-2830. #' @source See the HDBSCAN notebook from github documentation: #' \url{http://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html} #' @keywords datasets #' @examples #' data(moons) #' plot(moons, pch=20) NULL ================================================ FILE: R/ncluster.R ================================================ ####################################################################### # dbscan - Density Based Clustering of Applications with Noise # and Related Algorithms # Copyright (C) 2015 Michael Hahsler, Matt Piekenbrock # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License along # with this program; if not, write to the Free Software Foundation, Inc., # 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. #' Number of Clusters, Noise Points, and Observations #' #' Extract the number of clusters or the number of noise points for #' a clustering. This function works with any clustering result that #' contains a list element named `cluster` with a clustering vector. In #' addition, `nobs` (see [stats::nobs()]) is also available to retrieve #' the number of clustered points. #' #' @name ncluster #' @aliases ncluster nnoise nobs #' @family clustering functions #' #' @param object a clustering result object containing a `cluster` element. #' @param ... additional arguments are unused. #' #' @return returns the number if clusters or noise points. #' @examples #' data(iris) #' iris <- as.matrix(iris[, 1:4]) #' #' res <- dbscan(iris, eps = .7, minPts = 5) #' res #' #' ncluster(res) #' nnoise(res) #' nobs(res) #' #' # the functions also work with kmeans and other clustering algorithms. #' cl <- kmeans(iris, centers = 3) #' ncluster(cl) #' nnoise(cl) #' nobs(res) #' @export ncluster <- function(object, ...) { UseMethod("ncluster") } #' @export ncluster.default <- function(object, ...) { if (!is.list(object) || !is.numeric(object$cluster)) stop("ncluster() requires a clustering object with a cluster component containing the cluster labels.") length(setdiff(unique(object$cluster), 0L)) } #' @rdname ncluster #' @export nnoise <- function(object, ...) { UseMethod("nnoise") } #' @export nnoise.default <- function(object, ...) { if (!is.list(object) || !is.numeric(object$cluster)) stop("ncluster() requires a clustering object with a cluster component containing the cluster labels.") sum(object$cluster == 0L) } ================================================ FILE: R/nobs.R ================================================ #' @importFrom stats nobs #' @export nobs.dbscan <- function(object, ...) length(object$cluster) #' @export nobs.hdbscan <- function(object, ...) length(object$cluster) #' @export nobs.general_clustering <- function(object, ...) length(object$cluster) ================================================ FILE: R/optics.R ================================================ ####################################################################### # dbscan - Density Based Clustering of Applications with Noise # and Related Algorithms # Copyright (C) 2015 Michael Hahsler # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License along # with this program; if not, write to the Free Software Foundation, Inc., # 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. #' Ordering Points to Identify the Clustering Structure (OPTICS) #' #' Implementation of the OPTICS (Ordering points to identify the clustering #' structure) point ordering algorithm using a kd-tree. #' #' **The algorithm** #' #' This implementation of OPTICS implements the original #' algorithm as described by Ankerst et al (1999). OPTICS is an ordering #' algorithm with methods to extract a clustering from the ordering. #' While using similar concepts as DBSCAN, for OPTICS `eps` #' is only an upper limit for the neighborhood size used to reduce #' computational complexity. Note that `minPts` in OPTICS has a different #' effect then in DBSCAN. It is used to define dense neighborhoods, but since #' `eps` is typically set rather high, this does not effect the ordering #' much. However, it is also used to calculate the reachability distance and #' larger values will make the reachability distance plot smoother. #' #' OPTICS linearly orders the data points such that points which are spatially #' closest become neighbors in the ordering. The closest analog to this #' ordering is dendrogram in single-link hierarchical clustering. The algorithm #' also calculates the reachability distance for each point. #' `plot()` (see [reachability_plot]) #' produces a reachability plot which shows each points reachability distance #' between two consecutive points #' where the points are sorted by OPTICS. Valleys represent clusters (the #' deeper the valley, the more dense the cluster) and high points indicate #' points between clusters. #' #' **Specifying the data** #' #' If `x` is specified as a data matrix, then Euclidean distances and fast #' nearest neighbor lookup using a kd-tree are used. See [kNN()] for #' details on the parameters for the kd-tree. #' #' **Extracting a clustering** #' #' Several methods to extract a clustering from the order returned by OPTICS are #' implemented: #' #' * `extractDBSCAN()` extracts a clustering from an OPTICS ordering that is #' similar to what DBSCAN would produce with an eps set to `eps_cl` (see #' Ankerst et al, 1999). The only difference to a DBSCAN clustering is that #' OPTICS is not able to assign some border points and reports them instead as #' noise. #' #' * `extractXi()` extract clusters hierarchically specified in Ankerst et al #' (1999) based on the steepness of the reachability plot. One interpretation #' of the `xi` parameter is that it classifies clusters by change in #' relative cluster density. The used algorithm was originally contributed by #' the ELKI framework and is explained in Schubert et al (2018), but contains a #' set of fixes. #' #' **Predict cluster memberships** #' #' `predict()` requires an extracted DBSCAN clustering with `extractDBSCAN()` and then #' uses predict for `dbscan()`. #' #' @aliases optics OPTICS #' @family clustering functions #' #' @param x a data matrix or a [dist] object. #' @param eps upper limit of the size of the epsilon neighborhood. Limiting the #' neighborhood size improves performance and has no or very little impact on #' the ordering as long as it is not set too low. If not specified, the largest #' minPts-distance in the data set is used which gives the same result as #' infinity. #' @param minPts the parameter is used to identify dense neighborhoods and the #' reachability distance is calculated as the distance to the minPts nearest #' neighbor. Controls the smoothness of the reachability distribution. Default #' is 5 points. #' @param eps_cl Threshold to identify clusters (`eps_cl <= eps`). #' @param xi Steepness threshold to identify clusters hierarchically using the #' Xi method. #' @param object an object of class `optics`. #' @param minimum logical, representing whether or not to extract the minimal #' (non-overlapping) clusters in the Xi clustering algorithm. #' @param correctPredecessors logical, correct a common artifact by pruning #' the steep up area for points that have predecessors not in the #' cluster--found by the ELKI framework, see details below. #' @param ... additional arguments are passed on to fixed-radius nearest #' neighbor search algorithm. See [frNN()] for details on how to #' control the search strategy. #' @param cluster,predecessor plot clusters and predecessors. #' #' @return An object of class `optics` with components: #' \item{eps }{ value of `eps` parameter. } #' \item{minPts }{ value of `minPts` parameter. } #' \item{order }{ optics order for the data points in `x`. } #' \item{reachdist }{ [reachability] distance for each data point in `x`. } #' \item{coredist }{ core distance for each data point in `x`. } #' #' For `extractDBSCAN()`, in addition the following #' components are available: #' \item{eps_cl }{ the value of the `eps_cl` parameter. } #' \item{cluster }{ assigned cluster labels in the order of the data points in `x`. } #' #' For `extractXi()`, in addition the following components #' are available: #' \item{xi}{ Steepness threshold`x`. } #' \item{cluster }{ assigned cluster labels in the order of the data points in `x`.} #' \item{clusters_xi }{ data.frame containing the start and end of each cluster #' found in the OPTICS ordering. } #' #' @author Michael Hahsler and Matthew Piekenbrock #' @seealso Density [reachability]. #' #' @references Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Joerg #' Sander (1999). OPTICS: Ordering Points To Identify the Clustering Structure. #' _ACM SIGMOD international conference on Management of data._ ACM Press. pp. #' \doi{10.1145/304181.304187} #' #' Hahsler M, Piekenbrock M, Doran D (2019). dbscan: Fast Density-Based #' Clustering with R. _Journal of Statistical Software_, 91(1), 1-30. #' \doi{10.18637/jss.v091.i01} #' #' Erich Schubert, Michael Gertz (2018). Improving the Cluster Structure #' Extracted from OPTICS Plots. In _Lernen, Wissen, Daten, Analysen (LWDA 2018),_ #' pp. 318-329. #' @keywords model clustering #' @examples #' set.seed(2) #' n <- 400 #' #' x <- cbind( #' x = runif(4, 0, 1) + rnorm(n, sd = 0.1), #' y = runif(4, 0, 1) + rnorm(n, sd = 0.1) #' ) #' #' plot(x, col=rep(1:4, times = 100)) #' #' ### run OPTICS (Note: we use the default eps calculation) #' res <- optics(x, minPts = 10) #' res #' #' ### get order #' res$order #' #' ### plot produces a reachability plot #' plot(res) #' #' ### plot the order of points in the reachability plot #' plot(x, col = "grey") #' polygon(x[res$order, ]) #' #' ### extract a DBSCAN clustering by cutting the reachability plot at eps_cl #' res <- extractDBSCAN(res, eps_cl = .065) #' res #' #' plot(res) ## black is noise #' hullplot(x, res) #' #' ### re-cut at a higher eps threshold #' res <- extractDBSCAN(res, eps_cl = .07) #' res #' plot(res) #' hullplot(x, res) #' #' ### extract hierarchical clustering of varying density using the Xi method #' res <- extractXi(res, xi = 0.01) #' res #' #' plot(res) #' hullplot(x, res) #' #' # Xi cluster structure #' res$clusters_xi #' #' ### use OPTICS on a precomputed distance matrix #' d <- dist(x) #' res <- optics(d, minPts = 10) #' plot(res) #' @export optics <- function(x, eps = NULL, minPts = 5, ...) { ### find eps from minPts eps <- eps %||% max(kNNdist(x, k = minPts)) ### extra contains settings for frNN ### search = "kdtree", bucketSize = 10, splitRule = "suggest", approx = 0 extra <- list(...) args <- c("search", "bucketSize", "splitRule", "approx") m <- pmatch(names(extra), args) if (anyNA(m)) stop("Unknown parameter: ", toString(names(extra)[is.na(m)])) names(extra) <- args[m] search <- .parse_search(extra$search %||% "kdtree") splitRule <- .parse_splitRule(extra$splitRule %||% "suggest") bucketSize <- as.integer(extra$bucketSize %||% 10L) approx <- as.integer(extra$approx %||% 0L) ### dist search if (search == 3L && !inherits(x, "dist")) { if (.matrixlike(x)) x <- dist(x) else stop("x needs to be a matrix to calculate distances") } ## for dist we provide the R code with a frNN list and no x frNN <- list() if (inherits(x, "dist")) { frNN <- frNN(x, eps, ...) ## add self match and use C numbering frNN$id <- lapply( seq_along(frNN$id), FUN = function(i) c(i - 1L, frNN$id[[i]] - 1L) ) frNN$dist <- lapply( seq_along(frNN$dist), FUN = function(i) c(0, frNN$dist[[i]]) ^ 2 ) x <- matrix() storage.mode(x) <- "double" } else{ if (!.matrixlike(x)) stop("x needs to be a matrix") ## make sure x is numeric x <- as.matrix(x) if (storage.mode(x) == "integer") storage.mode(x) <- "double" if (storage.mode(x) != "double") stop("x has to be a numeric matrix.") } if (length(frNN) == 0 && anyNA(x)) stop("data/distances cannot contain NAs for optics (with kd-tree)!") ret <- optics_int( as.matrix(x), as.double(eps), as.integer(minPts), as.integer(search), as.integer(bucketSize), as.integer(splitRule), as.double(approx), frNN ) ret$minPts <- minPts ret$eps <- eps ret$eps_cl <- NA_real_ ret$xi <- NA_real_ class(ret) <- "optics" ret } #' @rdname optics #' @export print.optics <- function(x, ...) { writeLines(c( paste0( "OPTICS ordering/clustering for ", length(x$order), " objects." ), paste0( "Parameters: ", "minPts = ", x$minPts, ", eps = ", x$eps, ", eps_cl = ", x$eps_cl, ", xi = ", x$xi ) )) if (!is.null(x$cluster)) { if (is.na(x$xi)) { writeLines(paste0( "The clustering contains ", ncluster(x), " cluster(s) and ", nnoise(x), " noise points." )) print(table(x$cluster)) } else { writeLines( paste0( "The clustering contains ", nrow(x$clusters_xi), " cluster(s) and ", nnoise(x), " noise points." ) ) } cat("\n") } writeLines(strwrap(paste0( "Available fields: ", toString(names(x)) ), exdent = 18)) } #' @rdname optics #' @export plot.optics <- function(x, cluster = TRUE, predecessor = FALSE, ...) { # OPTICS cluster extraction methods if (inherits(x$cluster, "xics") || all(c("start", "end", "cluster_id") %in% names(x$clusters_xi))) { # Sort clusters by size hclusters <- x$clusters_xi[order(x$clusters_xi$end - x$clusters_xi$start), ] # .1 means to leave 15% for the cluster lines def.par <- par(no.readonly = TRUE) par(mar = c(2, 4, 4, 2) + 0.1, omd = c(0, 1, .15, 1)) # Need to know how to spread out lines y_max <- max(x$reachdist[!is.infinite(x$reachdist)]) y_increments <- (y_max / 0.85 * .15) / (nrow(hclusters) + 1L) # Get top level cluster labels # top_level <- extractClusterLabels(x$clusters_xi, x$order) plot( as.reachability(x), col = x$cluster[x$order] + 1L, xlab = NA, xaxt = 'n', yaxs = "i", ylim = c(0, y_max), ... ) # Lines beneath plotting region indicating Xi clusters i <- seq_len(nrow(hclusters)) segments( x0 = hclusters$start[i], y0 = -(y_increments * i), x1 = hclusters$end[i], col = hclusters$cluster_id[i] + 1L, lwd = 2, xpd = NA ) ## Restore previous settings par(def.par) } else if (is.numeric(x$cluster) && !is.null(x$eps_cl)) { # Works for integers too ## extractDBSCAN clustering plot(as.reachability(x), col = x$cluster[x$order] + 1L, ...) lines( x = c(0, length(x$cluster)), y = c(x$eps_cl, x$eps_cl), col = "black", lty = 2 ) } else { # Regular reachability plot plot(as.reachability(x), ...) } } # Simple conversion between OPTICS objects and reachability objects #' @rdname optics #' @export as.reachability.optics <- function(object, ...) { structure(list(reachdist = object$reachdist, order = object$order), class = "reachability") } # Conversion between OPTICS objects and dendrograms #' @rdname optics #' @export as.dendrogram.optics <- function(object, ...) { if (object$minPts > length(object$order)) { stop("'minPts' should be less or equal to the points in the dataset.") } if (sum(is.infinite(object$reachdist)) > 1) stop( "Eps value is not large enough to capture the complete hiearchical structure of the dataset. Please use a large eps value (such as Inf)." ) as.dendrogram(as.reachability(object)) } #' @rdname optics #' @export extractDBSCAN <- function(object, eps_cl) { if (!inherits(object, "optics")) stop("extractDBSCAN only accepts objects resulting from dbscan::optics!") reachdist <- object$reachdist[object$order] coredist <- object$coredist[object$order] n <- length(object$order) cluster <- integer(n) clusterid <- 0L ### 0 is noise for (i in 1:n) { if (reachdist[i] > eps_cl) { if (coredist[i] <= eps_cl) { clusterid <- clusterid + 1L cluster[i] <- clusterid } else{ cluster[i] <- 0L ### noise } } else{ cluster[i] <- clusterid } } object$eps_cl <- eps_cl object$xi <- NA_real_ ### fix the order so cluster is in the same order as the rows in x cluster[object$order] <- cluster object$cluster <- cluster object } #' @rdname optics #' @export extractXi <- function(object, xi, minimum = FALSE, correctPredecessors = TRUE) { if (!inherits(object, "optics")) stop("extractXi only accepts xs resulting from dbscan::optics!") if (xi >= 1.0 || xi <= 0.0) stop("The Xi parameter must be (0, 1)") # Initial variables object$ord_rd <- object$reachdist[object$order] object$ixi <- (1 - xi) SetOfSteepDownAreas <- list() SetOfClusters <- list() index <- 1 mib <- 0 sdaset <- list() while (index <= length(object$order)) { mib <- max(mib, object$ord_rd[index]) if (!valid(index + 1, object)) break # Test if this is a steep down area if (steepDown(index, object)) { # Update mib values with current mib and filter sdaset <- updateFilterSDASet(mib, sdaset, object$ixi) startval <- object$ord_rd[index] mib <- 0 startsteep <- index endsteep <- index + 1 while (!is.na(object$order[index + 1])) { index <- index + 1 if (steepDown(index, object)) { endsteep <- index + 1 next } if (!steepDown(index, object, ixi = 1.0) || index - endsteep > object$minPts) break } sda <- list( s = startsteep, e = endsteep, maximum = startval, mib = 0 ) # print(paste("New steep down area:", toString(sda))) sdaset <- append(sdaset, list(sda)) next } if (steepUp(index, object)) { sdaset <- updateFilterSDASet(mib, sdaset, object$ixi) { startsteep <- index endsteep <- index + 1 mib <- object$ord_rd[index] esuccr <- if (!valid(index + 1, object)) Inf else object$ord_rd[index + 1] if (!is.infinite(esuccr)) { while (!is.na(object$order[index + 1])) { index <- index + 1 if (steepUp(index, object)) { endsteep <- index + 1 mib <- object$ord_rd[index] esuccr <- if (!valid(index + 1, object)) Inf else object$ord_rd[index + 1] if (is.infinite(esuccr)) { endsteep <- endsteep - 1 break } next } if (!steepUp(index, object, ixi = 1.0) || index - endsteep > object$minPts) break } } else { endsteep <- endsteep - 1 index <- index + 1 } sua <- list(s = startsteep, e = endsteep, maximum = esuccr) # print(paste("New steep up area:", toString(sua))) } for (sda in rev(sdaset)) { # Condition 3B if (mib * object$ixi < sda$mib) next # Default values cstart <- sda$s cend <- sua$e # Credit to ELKI if (correctPredecessors) { while (cend > cstart && is.infinite(object$ord_rd[cend])) { cend <- cend - 1 } } # Condition 4 { # Case b if (sda$maximum * object$ixi >= sua$maximum) { while (cstart < cend && object$ord_rd[cstart + 1] > sua$maximum) cstart <- cstart + 1 } # Case c else if (sua$maximum * object$ixi >= sda$maximum) { while (cend > cstart && object$ord_rd[cend - 1] > sda$maximum) cend <- cend - 1 } } # This NOT in the original article - credit to ELKI for finding this. # Ensure that the predecessor is in the current cluster. This filter # removes common artifacts from the Xi method if (correctPredecessors) { while (cend > cstart) { tmp2 <- object$predecessor[object$order[cend]] if (!is.na(tmp2) && any(object$order[cstart:(cend - 1)] == tmp2, na.rm = TRUE)) break # Not found. cend <- cend - 1 } } # Ensure the last steep up point is not included if it's xi significant if (steepUp(index - 1, object)) { cend <- cend - 1 } # obey minpts if (cend - cstart + 1 < object$minPts) next SetOfClusters <- append(SetOfClusters, list(list( start = cstart, end = cend ))) next } } else { index <- index + 1 } } # Remove aliases object$ord_rd <- NULL object$ixi <- NULL # Keep xi parameter, disable any previous flat clustering parameter object$xi <- xi object$eps_cl <- NA_real_ # Zero-out clusters (only noise) if none found if (length(SetOfClusters) == 0) { warning(paste("No clusters were found with threshold:", xi)) object$clusters_xi <- NULL object$cluster < rep(0, length(object$cluster)) return(invisible(object)) } # Cluster data exists; organize it by starting and ending index, give arbitrary id object$clusters_xi <- do.call(rbind, SetOfClusters) object$clusters_xi <- data.frame( start = unlist(object$clusters_xi[, 1], use.names = FALSE), end = unlist(object$clusters_xi[, 2], use.names = FALSE), check.names = FALSE ) object$clusters_xi <- object$clusters_xi[order(object$clusters_xi$start, object$clusters_xi$end), ] object$clusters_xi <- cbind(object$clusters_xi, list(cluster_id = seq_len(nrow(object$clusters_xi)))) row.names(object$clusters_xi) <- NULL ## Populate cluster vector with either: ## 1. 'top-level' cluster labels to aid in plotting ## 2. 'local' or non-overlapping cluster labels if minimum == TRUE object$cluster <- extractClusterLabels(object$clusters_xi, object$order, minimum = minimum) # Remove non-local clusters if minimum was specified if (minimum) { object$clusters_xi <- object$clusters_xi[sort(unique(object$cluster))[-1], ] } class(object$cluster) <- unique(append(class(object$cluster), "xics")) class(object$clusters_xi) <- unique(append(class(object$clusters_xi), "xics")) object } # Removes obsolete steep areas updateFilterSDASet <- function(mib, sdaset, ixi) { sdaset <- Filter(function(sda) sda$maximum * ixi > mib, sdaset) lapply(sdaset, function(sda) { if (mib > sda$mib) sda$mib <- mib sda }) } # Determines if the reachability distance at the current index 'i' is # (xi) significantly lower than the next index steepUp <- function(i, object, ixi = object$ixi) { if (is.infinite(object$ord_rd[i])) return(FALSE) if (!valid(i + 1, object)) return(TRUE) return(object$ord_rd[i] <= object$ord_rd[i + 1] * ixi) } # Determines if the reachability distance at the current index 'i' is # (xi) significantly higher than the next index steepDown <- function(i, object, ixi = object$ixi) { if (!valid(i + 1, object)) return(FALSE) if (is.infinite(object$ord_rd[i + 1])) return(FALSE) return(object$ord_rd[i] * ixi >= object$ord_rd[i + 1]) } # Determines if the reachability distance at the current index 'i' is a valid distance valid <- function(index, object) { return(!is.na(object$ord_rd[index])) } ### Extract clusters (minimum == T extracts clusters that do not contain other clusters) from a given ordering of points extractClusterLabels <- function(cl, order, minimum = FALSE) { ## Add cluster_id to clusters if (!all(c("start", "end") %in% names(cl))) stop("extractClusterLabels expects start and end references") if (!"cluster_id" %in% names(cl)) cl <- cbind(cl, cluster_id = seq_len(nrow(cl))) ## Sort cl based on minimum parameter / cluster size if (!"cluster_size" %in% names(cl)) cl <- cbind(cl, list(cluster_size = (cl$end - cl$start))) cl <- if (minimum) { cl[order(cl$cluster_size), ] } else { cl[order(-cl$cluster_size), ] } ## Fill in the [cluster] vector with cluster IDs clusters <- rep(0, length(order)) for (cid in cl$cluster_id) { cluster <- cl[cl$cluster_id == cid, ] if (minimum) { if (all(clusters[cluster$start:cluster$end] == 0)) { clusters[cluster$start:cluster$end] <- cid } } else clusters[cluster$start:cluster$end] <- cid } # Fix the ordering clusters[order] <- clusters return(clusters) } ================================================ FILE: R/pointdensity.R ================================================ ####################################################################### # dbscan - Density Based Clustering of Applications with Noise # and Related Algorithms # Copyright (C) 2017 Michael Hahsler # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License along # with this program; if not, write to the Free Software Foundation, Inc., # 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. #' Calculate Local Density at Each Data Point #' #' Calculate the local density at each data point as either the number of #' points in the eps-neighborhood (as used in `dbscan()`) or perform kernel density #' estimation (KDE) using a uniform kernel. The function uses a kd-tree for fast #' fixed-radius nearest neighbor search. #' #' `dbscan()` estimates the density around a point as the number of points in the #' eps-neighborhood of the point (including the query point itself). #' Kernel density estimation (KDE) using a uniform kernel, which is just this point #' count in the eps-neighborhood divided by \eqn{(2\,eps\,n)}{(2 eps n)}, where #' \eqn{n} is the number of points in `x`. #' #' Alternatively, `type = "gaussian"` calculates a Gaussian kernel estimate where #' `eps` is used as the standard deviation. To speed up computation, a #' kd-tree is used to find all points within 3 times the standard deviation and #' these points are used for the estimate. #' #' Points with low local density often indicate noise (see e.g., Wishart (1969) #' and Hartigan (1975)). #' #' @aliases pointdensity density #' @family Outlier Detection Functions #' #' @param x a data matrix or a dist object. #' @param eps radius of the eps-neighborhood, i.e., bandwidth of the uniform #' kernel). For the Gaussian kde, this parameter specifies the standard deviation of #' the kernel. #' @param type `"frequency"`, `"density"`, or `"gaussian"`. should the raw count of #' points inside the eps-neighborhood, the eps-neighborhood density estimate, #' or a Gaussian density estimate be returned? #' @param search,bucketSize,splitRule,approx algorithmic parameters for #' [frNN()]. #' #' @return A vector of the same length as data points (rows) in `x` with #' the count or density values for each data point. #' #' @author Michael Hahsler #' @seealso [frNN()], [stats::density()]. #' @references Wishart, D. (1969), Mode Analysis: A Generalization of Nearest #' Neighbor which Reduces Chaining Effects, in _Numerical Taxonomy,_ Ed., A.J. #' Cole, Academic Press, 282-311. #' #' John A. Hartigan (1975), _Clustering Algorithms,_ John Wiley & Sons, Inc., #' New York, NY, USA. #' @keywords model #' @examples #' set.seed(665544) #' n <- 100 #' x <- cbind( #' x=runif(10, 0, 5) + rnorm(n, sd = 0.4), #' y=runif(10, 0, 5) + rnorm(n, sd = 0.4) #' ) #' plot(x) #' #' ### calculate density around points #' d <- pointdensity(x, eps = .5, type = "density") #' #' ### density distribution #' summary(d) #' hist(d, breaks = 10) #' #' ### plot with point size is proportional to Density #' plot(x, pch = 19, main = "Density (eps = .5)", cex = d*5) #' #' ### Wishart (1969) single link clustering after removing low-density noise #' # 1. remove noise with low density #' f <- pointdensity(x, eps = .5, type = "frequency") #' x_nonoise <- x[f >= 5,] #' #' # 2. use single-linkage on the non-noise points #' hc <- hclust(dist(x_nonoise), method = "single") #' plot(x, pch = 19, cex = .5) #' points(x_nonoise, pch = 19, col= cutree(hc, k = 4) + 1L) #' @export pointdensity <- function(x, eps, type = "frequency", search = "kdtree", bucketSize = 10, splitRule = "suggest", approx = 0) { type <- match.arg(type, choices = c("frequency", "density", "gaussian")) if (anyNA(x)) stop("missing values are not allowed in x.") if (type == "gaussian") return (.pointdensity_gaussian(x, sd = eps, search = search, bucketSize = bucketSize, splitRule = splitRule, approx = approx)) # regular dbscan density estimation if (inherits(x, "dist")) { nn <- frNN( x, eps, sort = FALSE, search = search, bucketSize = bucketSize, splitRule = splitRule, approx = approx ) d <- lengths(nn$id) + 1L } else { # faster implementation for a data matrix search <- .parse_search(search) splitRule <- .parse_splitRule(splitRule) d <- dbscan_density_int( as.matrix(x), as.double(eps), as.integer(search), as.integer(bucketSize), as.integer(splitRule), as.double(approx) ) } if (type == "density") d <- d / (2 * eps * nrow(x)) d } .pointdensity_gaussian <- function(x, sd, ...) { ### consider all points within 3 standard deviations nn <- frNN( x, 3 * sd, sort = FALSE, ... ) sigma <- sd^2 d <- sapply(nn$dist, FUN = function(ds) sum(exp(-1 * ds^2 / (2 * sigma)))) d <- d / (length(d) * sd * 2 * pi) d } #gof <- function(x, eps, ...) { # d <- pointdensity(x, eps, ...) # 1/(d/mean(d)) #} ================================================ FILE: R/predict.R ================================================ ####################################################################### # dbscan - Density Based Clustering of Applications with Noise # and Related Algorithms # Copyright (C) 2017 Michael Hahsler # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License along # with this program; if not, write to the Free Software Foundation, Inc., # 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. #' @rdname dbscan #' @param object clustering object. #' @param data the data set used to create the clustering object. #' @param newdata new data points for which the cluster membership should be #' predicted. #' @importFrom stats predict #' @export predict.dbscan_fast <- function (object, newdata, data, ...) { if (object$metric != "euclidean") warning("dbscan used non-Euclidean distances, predict assigns new points using Euclidean distances!") .predict_frNN(newdata, data, object$cluster, object$eps, ...) } #' @rdname optics #' @param object clustering object. #' @param data the data set used to create the clustering object. #' @param newdata new data points for which the cluster membership should be #' predicted. #' @export predict.optics <- function (object, newdata, data, ...) { if (is.null(object$cluster) || is.null(object$eps_cl) || is.na(object$eps_cl)) stop("no extracted clustering available in object! run extractDBSCAN() first.") .predict_frNN(newdata, data, object$cluster, object$eps_cl, ...) } #' @rdname hdbscan #' @param object clustering object. #' @param data the data set used to create the clustering object. #' @param newdata new data points for which the cluster membership should be #' predicted. #' @export predict.hdbscan <- function(object, newdata, data, ...) { clusters <- object$cluster if (is.null(newdata)) return(clusters) # don't use noise coredist <- object$coredist[clusters != 0] data <- data[clusters != 0,] clusters <- clusters[clusters != 0] # find minPts - 1 nearest neighbor nns <- kNN(data, query = newdata, k = 1) # choose cluster if dist <= coredist of that point drop(ifelse(nns$dist > coredist[nns$id], 0L, clusters[nns$id])) } ## find the cluster id of the closest NN in the eps neighborhood or return 0 otherwise. .predict_frNN <- function(newdata, data, clusters, eps, ...) { if (is.null(newdata)) return(clusters) if (ncol(data) != ncol(newdata)) stop("Number of columns in data and newdata do not agree!") if (nrow(data) != length(clusters)) stop("clustering does not agree with the number of data points in data.") if (is.data.frame(data)) { indx <- vapply(data, is.factor, logical(1L)) if (any(indx)) { warning( "data contains factors! The factors are converted to numbers and euclidean distances are used" ) } data[indx] <- lapply(data[indx], as.numeric) newdata[indx] <- lapply(newdata[indx], as.numeric) } # don't use noise data <- data[clusters != 0,] clusters <- clusters[clusters != 0] # calculate the frNN between newdata and data (only keep entries for newdata) nn <- frNN(data, query = newdata, eps = eps, sort = TRUE, ...) vapply( nn$id, function(nns) if (length(nns) == 0L) 0L else clusters[nns[1L]], integer(1L) ) } ================================================ FILE: R/reachability.R ================================================ ####################################################################### # dbscan - Density Based Clustering of Applications with Noise # and Related Algorithms # Copyright (C) 2015 Michael Hahsler, Matt Piekenbrock # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License along # with this program; if not, write to the Free Software Foundation, Inc., # 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. #' Reachability Distances #' #' Reachability distances can be plotted to show the hierarchical relationships between data points. #' The idea was originally introduced by Ankerst et al (1999) for [OPTICS]. Later, #' Sanders et al (2003) showed that the visualization is useful for other hierarchical #' structures and introduced an algorithm to convert [dendrogram] representation to #' reachability plots. #' #' A reachability plot displays the points as vertical bars, were the height is the #' reachability distance between two consecutive points. #' The central idea behind reachability plots is that the ordering in which #' points are plotted identifies underlying hierarchical density #' representation as mountains and valleys of high and low reachability distance. #' The original ordering algorithm OPTICS as described by Ankerst et al (1999) #' introduced the notion of reachability plots. #' #' OPTICS linearly orders the data points such that points #' which are spatially closest become neighbors in the ordering. Valleys #' represent clusters, which can be represented hierarchically. Although the #' ordering is crucial to the structure of the reachability plot, its important #' to note that OPTICS, like DBSCAN, is not entirely deterministic and, just #' like the dendrogram, isomorphisms may exist #' #' Reachability plots were shown to essentially convey the same information as #' the more traditional dendrogram structure by Sanders et al (2003). An dendrograms #' can be converted into reachability plots. #' #' Different hierarchical representations, such as dendrograms or reachability #' plots, may be preferable depending on the context. In smaller datasets, #' cluster memberships may be more easily identifiable through a dendrogram #' representation, particularly is the user is already familiar with tree-like #' representations. For larger datasets however, a reachability plot may be #' preferred for visualizing macro-level density relationships. #' #' A variety of cluster extraction methods have been proposed using #' reachability plots. Because both cluster extraction depend directly on the #' ordering OPTICS produces, they are part of the [optics()] interface. #' Nonetheless, reachability plots can be created directly from other types of #' linkage trees, and vice versa. #' #' _Note:_ The reachability distance for the first point is by definition not defined #' (it has no preceding point). #' Also, the reachability distances can be undefined when a point does not have enough #' neighbors in the epsilon neighborhood. We represent these undefined cases as `Inf` #' and represent them in the plot as a dashed line. #' #' @name reachability #' @aliases reachability reachability_plot print.reachability #' #' @param object any object that can be coerced to class #' `reachability`, such as an object of class [optics] or [stats::dendrogram]. #' @param x object of class `reachability`. #' @param order_labels whether to plot text labels for each points reachability #' distance. #' @param xlab x-axis label. #' @param ylab y-axis label. #' @param main Title of the plot. #' @param ... graphical parameters are passed on to `plot()`, #' or arguments for other methods. #' #' @return An object of class `reachability` with components: #' \item{order }{order to use for the data points in `x`. } #' \item{reachdist }{reachability distance for each data point in `x`. } #' #' @author Matthew Piekenbrock #' @seealso [optics()], [as.dendrogram()], and [stats::hclust()]. #' @references Ankerst, M., M. M. Breunig, H.-P. Kriegel, J. Sander (1999). #' OPTICS: Ordering Points To Identify the Clustering Structure. _ACM #' SIGMOD international conference on Management of data._ ACM Press. pp. #' 49--60. #' #' Sander, J., X. Qin, Z. Lu, N. Niu, and A. Kovarsky (2003). Automatic #' extraction of clusters from hierarchical clustering representations. #' _Pacific-Asia Conference on Knowledge Discovery and Data Mining._ #' Springer Berlin Heidelberg. #' @keywords model clustering hierarchical clustering #' @examples #' set.seed(2) #' n <- 20 #' #' x <- cbind( #' x = runif(4, 0, 1) + rnorm(n, sd = 0.1), #' y = runif(4, 0, 1) + rnorm(n, sd = 0.1) #' ) #' #' plot(x, xlim = range(x), ylim = c(min(x) - sd(x), max(x) + sd(x)), pch = 20) #' text(x = x, labels = seq_len(nrow(x)), pos = 3) #' #' ### run OPTICS #' res <- optics(x, eps = 10, minPts = 2) #' res #' #' ### plot produces a reachability plot. #' plot(res) #' #' ### Manually extract reachability components from OPTICS #' reach <- as.reachability(res) #' reach #' #' ### plot still produces a reachability plot; points ids #' ### (rows in the original data) can be displayed with order_labels = TRUE #' plot(reach, order_labels = TRUE) #' #' ### Reachability objects can be directly converted to dendrograms #' dend <- as.dendrogram(reach) #' dend #' plot(dend) #' #' ### A dendrogram can be converted back into a reachability object #' plot(as.reachability(dend)) NULL #' @rdname reachability #' @export print.reachability <- function(x, ...) { avg_reach <- mean(x$reachdist[!is.infinite(x$reachdist)], na.rm = TRUE) cat( "Reachability plot collection for ", length(x$order), " objects.\n", "Avg minimum reachability distance: ", avg_reach, "\n", "Available Fields: order, reachdist", sep = "" ) } #' @rdname reachability #' @export plot.reachability <- function(x, order_labels = FALSE, xlab = "Order", ylab = "Reachability dist.", main = "Reachability Plot", ...) { if (is.null(x$order) || is.null(x$reachdist)) stop("reachability objects need 'reachdist' and 'order' fields") reachdist <- x$reachdist[x$order] plot( reachdist, xlab = xlab, ylab = ylab, main = main, type = "h", ... ) abline(v = which(is.infinite(reachdist)), lty = 3) if (order_labels) { text( x = seq_along(x$order), y = reachdist, labels = x$order, pos = 3 ) } } #' @rdname reachability #' @export as.reachability <- function(object, ...) UseMethod("as.reachability") #' @rdname reachability #' @export as.reachability.dendrogram <- function(object, ...) { if (!inherits(object, "dendrogram")) stop("The as.reachability method requires a dendrogram object.") # Rcpp doesn't seem to import attributes well for vectors fix_x <- dendrapply(object, function(leaf) { new_leaf <- as.list(leaf) attributes(new_leaf) <- attributes(leaf) new_leaf }) res <- dendrogram_to_reach(fix_x) # Refix the ordering res$reachdist <- res$reachdist[order(res$order)] return(res) } ================================================ FILE: R/sNN.R ================================================ ####################################################################### # dbscan - Density Based Clustering of Applications with Noise # and Related Algorithms # Copyright (C) 2017 Michael Hahsler # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License along # with this program; if not, write to the Free Software Foundation, Inc., # 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. # number of shared nearest neighbors including the point itself. #' Find Shared Nearest Neighbors #' #' Calculates the number of shared nearest neighbors #' and creates a shared nearest neighbors graph. #' #' The number of shared nearest neighbors of two points p and q is the #' intersection of the kNN neighborhood of two points. #' Note: that each point is considered to be part #' of its own kNN neighborhood. #' The range for the shared nearest neighbors is #' \eqn{[0, k]}. The result is a n-by-k matrix called `shared`. #' Each row is a point and the columns are the point's k nearest neighbors. #' The value is the count of the shared neighbors. #' #' The shared nearest neighbor graph connects a point with all its nearest neighbors #' if they have at least one shared neighbor. The number of shared neighbors can be used #' as an edge weight. #' Javis and Patrick (1973) use a slightly #' modified (see parameter `jp`) shared nearest neighbor graph for #' clustering. #' #' @aliases sNN snn #' @family NN functions #' #' @param x a data matrix, a [dist] object or a [kNN] object. #' @param k number of neighbors to consider to calculate the shared nearest #' neighbors. #' @param kt minimum threshold on the number of shared nearest neighbors to #' build the shared nearest neighbor graph. Edges are only preserved if #' `kt` or more neighbors are shared. #' @param jp In regular sNN graphs, two points that are not neighbors #' can have shared neighbors. #' Javis and Patrick (1973) requires the two points to be neighbors, otherwise #' the count is zeroed out. `TRUE` uses this behavior. #' @param search nearest neighbor search strategy (one of `"kdtree"`, `"linear"` or #' `"dist"`). #' @param sort sort by the number of shared nearest neighbors? Note that this #' is expensive and `sort = FALSE` is much faster. sNN objects can be #' sorted using `sort()`. #' @param bucketSize max size of the kd-tree leafs. #' @param splitRule rule to split the kd-tree. One of `"STD"`, `"MIDPT"`, `"FAIR"`, #' `"SL_MIDPT"`, `"SL_FAIR"` or `"SUGGEST"` (SL stands for sliding). `"SUGGEST"` uses #' ANNs best guess. #' @param approx use approximate nearest neighbors. All NN up to a distance of #' a factor of `(1 + approx) eps` may be used. Some actual NN may be omitted #' leading to spurious clusters and noise points. However, the algorithm will #' enjoy a significant speedup. #' @param decreasing logical; sort in decreasing order? #' @param ... additional parameters are passed on. #' @return An object of class `sNN` (subclass of [kNN] and [NN]) containing a list #' with the following components: #' \item{id }{a matrix with ids. } #' \item{dist}{a matrix with the distances. } #' \item{shared }{a matrix with the number of shared nearest neighbors. } #' \item{k }{number of `k` used. } #' \item{metric }{the used distance metric. } #' #' @author Michael Hahsler #' @references R. A. Jarvis and E. A. Patrick. 1973. Clustering Using a #' Similarity Measure Based on Shared Near Neighbors. _IEEE Trans. Comput._ #' 22, 11 (November 1973), 1025-1034. #' \doi{10.1109/T-C.1973.223640} #' @keywords model #' @examples #' data(iris) #' x <- iris[, -5] #' #' # finding kNN and add the number of shared nearest neighbors. #' k <- 5 #' nn <- sNN(x, k = k) #' nn #' #' # shared nearest neighbor distribution #' table(as.vector(nn$shared)) #' #' # explore number of shared points for the k-neighborhood of point 10 #' i <- 10 #' nn$shared[i,] #' #' plot(nn, x) #' #' # apply a threshold to create a sNN graph with edges #' # if more than 3 neighbors are shared. #' nn_3 <- sNN(nn, kt = 3) #' plot(nn_3, x) #' #' # get an adjacency list for the shared nearest neighbor graph #' adjacencylist(nn_3) #' @export sNN <- function(x, k, kt = NULL, jp = FALSE, sort = TRUE, search = "kdtree", bucketSize = 10, splitRule = "suggest", approx = 0) { if (missing(k)) k <- x$k if (inherits(x, "kNN")) { if (k != x$k) { if (ncol(x$id) < k) stop("kNN object does not contain enough neighbors!") if (!x$sort) x <- sort.kNN(x) x$id <- x$id[, 1:k] x$dist <- x$dist[, 1:k] x$k <- k } } else x <- kNN( x, k, sort = FALSE, search = search, bucketSize = bucketSize, splitRule = splitRule, approx = approx ) x$shared <- SNN_sim_int(x$id, as.logical(jp[1])) x$sort_shared <- FALSE class(x) <- c("sNN", "kNN", "NN") if (sort) x <- sort.sNN(x) x$kt <- kt if (!is.null(kt)) { if (kt > k) stop("kt needs to be less than k.") rem <- x$shared < kt x$id[rem] <- NA x$dist[rem] <- NA x$shared[rem] <- NA } x } #' @rdname sNN #' @export sort.sNN <- function(x, decreasing = TRUE, ...) { if (isTRUE(x$sort_shared)) return(x) if (is.null(x$shared)) stop("Unable to sort. Number of shared neighbors is missing.") if (ncol(x$id) < 2) { x$sort <- TRUE x$sort_shared <- TRUE return(x) } ## sort first by number of shared points (decreasing) and break ties by id (increasing) k <- ncol(x$shared) o <- vapply( seq_len(nrow(x$shared)), function(i) order(k - x$shared[i, ], x$id[i, ], decreasing = !decreasing), integer(k) ) for (i in seq_len(ncol(o))) { x$shared[i, ] <- x$shared[i, ][o[, i]] x$dist[i, ] <- x$dist[i, ][o[, i]] x$id[i, ] <- x$id[i, ][o[, i]] } x$sort <- FALSE x$sort_shared <- TRUE x } #' @rdname sNN #' @export print.sNN <- function(x, ...) { cat( "shared-nearest neighbors for ", nrow(x$id), " objects (k=", x$k, ", kt=", x$kt %||% "NULL", ").", "\n", sep = "" ) cat("Available fields: ", toString(names(x)), "\n", sep = "") } ================================================ FILE: R/sNNclust.R ================================================ ####################################################################### # dbscan - Density Based Clustering of Applications with Noise # and Related Algorithms # Copyright (C) 2017 Michael Hahsler # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License along # with this program; if not, write to the Free Software Foundation, Inc., # 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. #' Shared Nearest Neighbor Clustering #' #' Implements the shared nearest neighbor clustering algorithm by Ertoz, #' Steinbach and Kumar (2003). #' #' **Algorithm:** #' #' 1. Constructs a shared nearest neighbor graph for a given k. The edge #' weights are the number of shared k nearest neighbors (in the range of #' \eqn{[0, k]}). #' #' 2. Find each points SNN density, i.e., the number of points which have a #' similarity of `eps` or greater. #' #' 3. Find the core points, i.e., all points that have an SNN density greater #' than `MinPts`. #' #' 4. Form clusters from the core points and assign border points (i.e., #' non-core points which share at least `eps` neighbors with a core point). #' #' Note that steps 2-4 are equivalent to the DBSCAN algorithm (see [dbscan()]) #' and that `eps` has a different meaning than for DBSCAN. Here it is #' a threshold on the number of shared neighbors (see [sNN()]) #' which defines a similarity. #' #' @aliases sNNclust snnclust #' @family clustering functions #' #' @param x a data matrix/data.frame (Euclidean distance is used), a #' precomputed [dist] object or a kNN object created with [kNN()]. #' @param k Neighborhood size for nearest neighbor sparsification to create the #' shared NN graph. #' @param eps Two objects are only reachable from each other if they share at #' least `eps` nearest neighbors. Note: this is different from the `eps` in DBSCAN! #' @param minPts minimum number of points that share at least `eps` #' nearest neighbors for a point to be considered a core points. #' @param borderPoints should border points be assigned to clusters like in #' [DBSCAN]? #' @param ... additional arguments are passed on to the k nearest neighbor #' search algorithm. See [kNN()] for details on how to control the #' search strategy. #' #' @return A object of class `general_clustering` with the following #' components: #' \item{cluster }{A integer vector with cluster assignments. Zero #' indicates noise points.} #' \item{type }{ name of used clustering algorithm.} #' \item{param }{ list of used clustering parameters. } #' #' @author Michael Hahsler #' #' @references Levent Ertoz, Michael Steinbach, Vipin Kumar, Finding Clusters #' of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data, #' _SIAM International Conference on Data Mining,_ 2003, 47-59. #' \doi{10.1137/1.9781611972733.5} #' @keywords model clustering #' @examples #' data("DS3") #' #' # Out of k = 20 NN 7 (eps) have to be shared to create a link in the sNN graph. #' # A point needs a least 16 (minPts) links in the sNN graph to be a core point. #' # Noise points have cluster id 0 and are shown in black. #' cl <- sNNclust(DS3, k = 20, eps = 7, minPts = 16) #' cl #' #' clplot(DS3, cl) #' #' @export sNNclust <- function(x, k, eps, minPts, borderPoints = TRUE, ...) { nn <- sNN(x, k = k, jp = TRUE, ...) # convert into a frNN object which already enforces eps nn_list <- lapply(seq_len(nrow(nn$id)), FUN = function(i) unname(nn$id[i, nn$shared[i, ] >= eps])) snn <- structure(list(id = nn_list, eps = eps, metric = nn$metric), class = c("NN", "frNN")) # run dbscan cl <- dbscan(snn, minPts = minPts, borderPoints = borderPoints) structure(list(cluster = cl$cluster, type = "SharedNN clustering", param = list(k = k, eps = eps, minPts = minPts, borderPoints = borderPoints), metric = cl$metric), class = "general_clustering") } ================================================ FILE: R/utils.R ================================================ `%||%` <- function(x, y) { if (is.null(x)) y else x } ================================================ FILE: R/zzz.R ================================================ # ANN uses a global KD_TRIVIAL structure which needs to be removed. .onUnload <- function(libpath) { ANN_cleanup() #cat("Cleaning up after ANN.\n") } ================================================ FILE: README.Rmd ================================================ --- output: github_document bibliography: vignettes/dbscan.bib link-citations: yes --- ```{r echo=FALSE, results = 'asis'} pkg <- 'dbscan' source("https://raw.githubusercontent.com/mhahsler/pkg_helpers/main/pkg_helpers.R") pkg_title(pkg, anaconda = "r-dbscan", stackoverflow = "dbscan%2br") ``` ## Introduction This R package [@hahsler2019dbscan] provides a fast C++ (re)implementation of several density-based algorithms with a focus on the DBSCAN family for clustering spatial data. The package includes: __Clustering__ - __DBSCAN:__ Density-based spatial clustering of applications with noise [@ester1996density]. - __Jarvis-Patrick Clustering__: Clustering using a similarity measure based on shared near neighbors [@jarvis1973]. - __SNN Clustering__: Shared nearest neighbor clustering [@erdoz2003]. - __HDBSCAN:__ Hierarchical DBSCAN with simplified hierarchy extraction [@campello2015hierarchical]. - __FOSC:__ Framework for optimal selection of clusters for unsupervised and semisupervised clustering of hierarchical cluster tree [@campello2013density]. - __OPTICS/OPTICSXi:__ Ordering points to identify the clustering structure and cluster extraction methods [@ankerst1999optics]. __Outlier Detection__ - __LOF:__ Local outlier factor algorithm [@breunig2000lof]. - __GLOSH:__ Global-Local Outlier Score from Hierarchies algorithm [@campello2015hierarchical]. __Cluster Evaluation__ - __DBCV:__ Density-based clustering validation [@moulavi2014]. __Fast Nearest-Neighbor Search (using kd-trees)__ - __kNN search__ - __Fixed-radius NN search__ The implementations use the kd-tree data structure (from library ANN) for faster k-nearest neighbor search, and are for Euclidean distance typically faster than the native R implementations (e.g., dbscan in package `fpc`), or the implementations in [WEKA](https://ml.cms.waikato.ac.nz/weka/), [ELKI](https://elki-project.github.io/) and [Python's scikit-learn](https://scikit-learn.org/). ```{r echo=FALSE, results = 'asis'} pkg_usage(pkg) pkg_citation(pkg, 2) pkg_install(pkg) ``` ## Usage Load the package and use the numeric variables in the iris dataset ```{r} library("dbscan") data("iris") x <- as.matrix(iris[, 1:4]) ``` DBSCAN ```{r} db <- dbscan(x, eps = .42, minPts = 5) db ``` Visualize the resulting clustering (noise points are shown in black). ```{r dbscan} pairs(x, col = db$cluster + 1L) ``` OPTICS ```{r} opt <- optics(x, eps = 1, minPts = 4) opt ``` Extract DBSCAN-like clustering from OPTICS and create a reachability plot (extracted DBSCAN clusters at eps_cl=.4 are colored) ```{r OPTICS_extractDBSCAN, fig.height=3} opt <- extractDBSCAN(opt, eps_cl = .4) plot(opt) ``` HDBSCAN ```{r} hdb <- hdbscan(x, minPts = 4) hdb ``` Visualize the hierarchical clustering as a simplified tree. HDBSCAN finds 2 stable clusters. ```{r hdbscan, fig.height=4} plot(hdb, show_flat = TRUE) ``` ## Using dbscan with tidyverse `dbscan` provides for all clustering algorithms `tidy()`, `augment()`, and `glance()` so they can be easily used with tidyverse, ggplot2 and [tidymodels](https://www.tidymodels.org/learn/statistics/k-means/). ```{r tidyverse, message=FALSE, warning=FALSE} library(tidyverse) db <- x %>% dbscan(eps = .42, minPts = 5) ``` Get cluster statistics as a tibble ```{r tidyverse2} tidy(db) ``` Visualize the clustering with ggplot2 (use an x for noise points) ```{r tidyverse3} augment(db, x) %>% ggplot(aes(x = Petal.Length, y = Petal.Width)) + geom_point(aes(color = .cluster, shape = noise)) + scale_shape_manual(values=c(19, 4)) ``` ## Using dbscan from Python R, the R package `dbscan`, and the Python package `rpy2` need to be installed. ```{python, eval = FALSE, python.reticulate = FALSE} import pandas as pd import numpy as np ### prepare data iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header = None, names = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species']) iris_numeric = iris[['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']] # get R dbscan package from rpy2.robjects import packages dbscan = packages.importr('dbscan') # enable automatic conversion of pandas dataframes to R dataframes from rpy2.robjects import pandas2ri pandas2ri.activate() db = dbscan.dbscan(iris_numeric, eps = 0.5, MinPts = 5) print(db) ``` ``` ## DBSCAN clustering for 150 objects. ## Parameters: eps = 0.5, minPts = 5 ## Using euclidean distances and borderpoints = TRUE ## The clustering contains 2 cluster(s) and 17 noise points. ## ## 0 1 2 ## 17 49 84 ## ## Available fields: cluster, eps, minPts, dist, borderPoints ``` ```{python, eval = FALSE, python.reticulate = FALSE} # get the cluster assignment vector labels = np.array(db.rx('cluster')) labels ``` ``` ## array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ## 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, ## 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2, ## 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, ## 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2, 0, 0, ## 2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0, ## 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]], ## dtype=int32) ``` ## License The dbscan package is licensed under the [GNU General Public License (GPL) Version 3](https://www.gnu.org/licenses/gpl-3.0.en.html). The __OPTICSXi__ R implementation was directly ported from the ELKI framework's Java implementation (GNU AGPLv3), with permission by the original author, Erich Schubert. ## Changes * List of changes from [NEWS.md](https://github.com/mhahsler/dbscan/blob/master/NEWS.md) ## References ================================================ FILE: README.md ================================================ # R package dbscan - Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Related Algorithms [![Package on CRAN](https://www.r-pkg.org/badges/version/dbscan)](https://CRAN.R-project.org/package=dbscan) [![CRAN RStudio mirror downloads](https://cranlogs.r-pkg.org/badges/dbscan)](https://CRAN.R-project.org/package=dbscan) ![License](https://img.shields.io/cran/l/dbscan) [![Anaconda.org](https://anaconda.org/conda-forge/r-dbscan/badges/version.svg)](https://anaconda.org/conda-forge/r-dbscan) [![r-universe status](https://mhahsler.r-universe.dev/badges/dbscan)](https://mhahsler.r-universe.dev/dbscan) [![StackOverflow](https://img.shields.io/badge/stackoverflow-dbscan%2br-orange.svg)](https://stackoverflow.com/questions/tagged/dbscan%2br) ## Introduction This R package ([Hahsler, Piekenbrock, and Doran 2019](#ref-hahsler2019dbscan)) provides a fast C++ (re)implementation of several density-based algorithms with a focus on the DBSCAN family for clustering spatial data. The package includes: **Clustering** - **DBSCAN:** Density-based spatial clustering of applications with noise ([Ester et al. 1996](#ref-ester1996density)). - **Jarvis-Patrick Clustering**: Clustering using a similarity measure based on shared near neighbors ([Jarvis and Patrick 1973](#ref-jarvis1973)). - **SNN Clustering**: Shared nearest neighbor clustering ([Ertöz, Steinbach, and Kumar 2003](#ref-erdoz2003)). - **HDBSCAN:** Hierarchical DBSCAN with simplified hierarchy extraction ([Campello et al. 2015](#ref-campello2015hierarchical)). - **FOSC:** Framework for optimal selection of clusters for unsupervised and semisupervised clustering of hierarchical cluster tree ([Campello, Moulavi, and Sander 2013](#ref-campello2013density)). - **OPTICS/OPTICSXi:** Ordering points to identify the clustering structure and cluster extraction methods ([Ankerst et al. 1999](#ref-ankerst1999optics)). **Outlier Detection** - **LOF:** Local outlier factor algorithm ([Breunig et al. 2000](#ref-breunig2000lof)). - **GLOSH:** Global-Local Outlier Score from Hierarchies algorithm ([Campello et al. 2015](#ref-campello2015hierarchical)). **Cluster Evaluation** - **DBCV:** Density-based clustering validation ([Moulavi et al. 2014](#ref-moulavi2014)). **Fast Nearest-Neighbor Search (using kd-trees)** - **kNN search** - **Fixed-radius NN search** The implementations use the kd-tree data structure (from library ANN) for faster k-nearest neighbor search, and are for Euclidean distance typically faster than the native R implementations (e.g., dbscan in package `fpc`), or the implementations in [WEKA](https://ml.cms.waikato.ac.nz/weka/), [ELKI](https://elki-project.github.io/) and [Python’s scikit-learn](https://scikit-learn.org/). The following R packages use `dbscan`: [AnimalSequences](https://CRAN.R-project.org/package=AnimalSequences), [bioregion](https://CRAN.R-project.org/package=bioregion), [clayringsmiletus](https://CRAN.R-project.org/package=clayringsmiletus), [CLONETv2](https://CRAN.R-project.org/package=CLONETv2), [clusterWebApp](https://CRAN.R-project.org/package=clusterWebApp), [cordillera](https://CRAN.R-project.org/package=cordillera), [CPC](https://CRAN.R-project.org/package=CPC), [crosshap](https://CRAN.R-project.org/package=crosshap), [crownsegmentr](https://CRAN.R-project.org/package=crownsegmentr), [CspStandSegmentation](https://CRAN.R-project.org/package=CspStandSegmentation), [daltoolbox](https://CRAN.R-project.org/package=daltoolbox), [DataSimilarity](https://CRAN.R-project.org/package=DataSimilarity), [diceR](https://CRAN.R-project.org/package=diceR), [dobin](https://CRAN.R-project.org/package=dobin), [doc2vec](https://CRAN.R-project.org/package=doc2vec), [dPCP](https://CRAN.R-project.org/package=dPCP), [emcAdr](https://CRAN.R-project.org/package=emcAdr), [eventstream](https://CRAN.R-project.org/package=eventstream), [evprof](https://CRAN.R-project.org/package=evprof), [fastml](https://CRAN.R-project.org/package=fastml), [FCPS](https://CRAN.R-project.org/package=FCPS), [flowcluster](https://CRAN.R-project.org/package=flowcluster), [funtimes](https://CRAN.R-project.org/package=funtimes), [FuzzyDBScan](https://CRAN.R-project.org/package=FuzzyDBScan), [HaploVar](https://CRAN.R-project.org/package=HaploVar), [immunaut](https://CRAN.R-project.org/package=immunaut), [karyotapR](https://CRAN.R-project.org/package=karyotapR), [ksharp](https://CRAN.R-project.org/package=ksharp), [LLMing](https://CRAN.R-project.org/package=LLMing), [LOMAR](https://CRAN.R-project.org/package=LOMAR), [maotai](https://CRAN.R-project.org/package=maotai), [MapperAlgo](https://CRAN.R-project.org/package=MapperAlgo), [metaCluster](https://CRAN.R-project.org/package=metaCluster), [metasnf](https://CRAN.R-project.org/package=metasnf), [mlr3cluster](https://CRAN.R-project.org/package=mlr3cluster), [neuroim2](https://CRAN.R-project.org/package=neuroim2), [oclust](https://CRAN.R-project.org/package=oclust), [omicsTools](https://CRAN.R-project.org/package=omicsTools), [openSkies](https://CRAN.R-project.org/package=openSkies), [opticskxi](https://CRAN.R-project.org/package=opticskxi), [OTclust](https://CRAN.R-project.org/package=OTclust), [outlierensembles](https://CRAN.R-project.org/package=outlierensembles), [outlierMBC](https://CRAN.R-project.org/package=outlierMBC), [pagoda2](https://CRAN.R-project.org/package=pagoda2), [parameters](https://CRAN.R-project.org/package=parameters), [ParBayesianOptimization](https://CRAN.R-project.org/package=ParBayesianOptimization), [performance](https://CRAN.R-project.org/package=performance), [PiC](https://CRAN.R-project.org/package=PiC), [rcrisp](https://CRAN.R-project.org/package=rcrisp), [rMultiNet](https://CRAN.R-project.org/package=rMultiNet), [seriation](https://CRAN.R-project.org/package=seriation), [sfdep](https://CRAN.R-project.org/package=sfdep), [sfnetworks](https://CRAN.R-project.org/package=sfnetworks), [sharp](https://CRAN.R-project.org/package=sharp), [smotefamily](https://CRAN.R-project.org/package=smotefamily), [snap](https://CRAN.R-project.org/package=snap), [spdep](https://CRAN.R-project.org/package=spdep), [spNetwork](https://CRAN.R-project.org/package=spNetwork), [ssMRCD](https://CRAN.R-project.org/package=ssMRCD), [stream](https://CRAN.R-project.org/package=stream), [SuperCell](https://CRAN.R-project.org/package=SuperCell), [synr](https://CRAN.R-project.org/package=synr), [tidySEM](https://CRAN.R-project.org/package=tidySEM), [VBphenoR](https://CRAN.R-project.org/package=VBphenoR), [VIProDesign](https://CRAN.R-project.org/package=VIProDesign), [weird](https://CRAN.R-project.org/package=weird) To cite package ‘dbscan’ in publications use: > Hahsler M, Piekenbrock M, Doran D (2019). “dbscan: Fast Density-Based > Clustering with R.” *Journal of Statistical Software*, *91*(1), 1-30. > . @Article{, title = {{dbscan}: Fast Density-Based Clustering with {R}}, author = {Michael Hahsler and Matthew Piekenbrock and Derek Doran}, journal = {Journal of Statistical Software}, year = {2019}, volume = {91}, number = {1}, pages = {1--30}, doi = {10.18637/jss.v091.i01}, } ## Installation **Stable CRAN version:** Install from within R with ``` r install.packages("dbscan") ``` **Current development version:** Install from [r-universe.](https://mhahsler.r-universe.dev/dbscan) ``` r install.packages("dbscan", repos = c("https://mhahsler.r-universe.dev", "https://cloud.r-project.org/")) ``` ## Usage Load the package and use the numeric variables in the iris dataset ``` r library("dbscan") data("iris") x <- as.matrix(iris[, 1:4]) ``` DBSCAN ``` r db <- dbscan(x, eps = 0.42, minPts = 5) db ``` ## DBSCAN clustering for 150 objects. ## Parameters: eps = 0.42, minPts = 5 ## Using euclidean distances and borderpoints = TRUE ## The clustering contains 3 cluster(s) and 29 noise points. ## ## 0 1 2 3 ## 29 48 37 36 ## ## Available fields: cluster, eps, minPts, metric, borderPoints Visualize the resulting clustering (noise points are shown in black). ``` r pairs(x, col = db$cluster + 1L) ``` ![](inst/README_files/dbscan-1.png) OPTICS ``` r opt <- optics(x, eps = 1, minPts = 4) opt ``` ## OPTICS ordering/clustering for 150 objects. ## Parameters: minPts = 4, eps = 1, eps_cl = NA, xi = NA ## Available fields: order, reachdist, coredist, predecessor, minPts, eps, ## eps_cl, xi Extract DBSCAN-like clustering from OPTICS and create a reachability plot (extracted DBSCAN clusters at eps_cl=.4 are colored) ``` r opt <- extractDBSCAN(opt, eps_cl = 0.4) plot(opt) ``` ![](inst/README_files/OPTICS_extractDBSCAN-1.png) HDBSCAN ``` r hdb <- hdbscan(x, minPts = 4) hdb ``` ## HDBSCAN clustering for 150 objects. ## Parameters: minPts = 4 ## The clustering contains 2 cluster(s) and 0 noise points. ## ## 1 2 ## 100 50 ## ## Available fields: cluster, minPts, coredist, cluster_scores, ## membership_prob, outlier_scores, hc Visualize the hierarchical clustering as a simplified tree. HDBSCAN finds 2 stable clusters. ``` r plot(hdb, show_flat = TRUE) ``` ![](inst/README_files/hdbscan-1.png) ## Using dbscan with tidyverse `dbscan` provides for all clustering algorithms `tidy()`, `augment()`, and `glance()` so they can be easily used with tidyverse, ggplot2 and [tidymodels](https://www.tidymodels.org/learn/statistics/k-means/). ``` r library(tidyverse) db <- x %>% dbscan(eps = 0.42, minPts = 5) ``` Get cluster statistics as a tibble ``` r tidy(db) ``` ## # A tibble: 4 × 3 ## cluster size noise ## ## 1 0 29 TRUE ## 2 1 48 FALSE ## 3 2 37 FALSE ## 4 3 36 FALSE Visualize the clustering with ggplot2 (use an x for noise points) ``` r augment(db, x) %>% ggplot(aes(x = Petal.Length, y = Petal.Width)) + geom_point(aes(color = .cluster, shape = noise)) + scale_shape_manual(values = c(19, 4)) ``` ![](inst/README_files/tidyverse3-1.png) ## Using dbscan from Python R, the R package `dbscan`, and the Python package `rpy2` need to be installed. ``` python import pandas as pd import numpy as np ### prepare data iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header = None, names = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species']) iris_numeric = iris[['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']] # get R dbscan package from rpy2.robjects import packages dbscan = packages.importr('dbscan') # enable automatic conversion of pandas dataframes to R dataframes from rpy2.robjects import pandas2ri pandas2ri.activate() db = dbscan.dbscan(iris_numeric, eps = 0.5, MinPts = 5) print(db) ``` ## DBSCAN clustering for 150 objects. ## Parameters: eps = 0.5, minPts = 5 ## Using euclidean distances and borderpoints = TRUE ## The clustering contains 2 cluster(s) and 17 noise points. ## ## 0 1 2 ## 17 49 84 ## ## Available fields: cluster, eps, minPts, dist, borderPoints ``` python # get the cluster assignment vector labels = np.array(db.rx('cluster')) labels ``` ## array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ## 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, ## 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2, ## 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, ## 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2, 0, 0, ## 2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0, ## 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]], ## dtype=int32) ## License The dbscan package is licensed under the [GNU General Public License (GPL) Version 3](https://www.gnu.org/licenses/gpl-3.0.en.html). The **OPTICSXi** R implementation was directly ported from the ELKI framework’s Java implementation (GNU AGPLv3), with permission by the original author, Erich Schubert. ## Changes - List of changes from [NEWS.md](https://github.com/mhahsler/dbscan/blob/master/NEWS.md) ## References
Ankerst, Mihael, Markus M Breunig, Hans-Peter Kriegel, and Jörg Sander. 1999. “OPTICS: Ordering Points to Identify the Clustering Structure.” In *ACM Sigmod Record*, 28:49–60. 2. ACM. .
Breunig, Markus M, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. 2000. “LOF: Identifying Density-Based Local Outliers.” In *ACM Int. Conf. On Management of Data*, 29:93–104. 2. ACM. .
Campello, Ricardo JGB, Davoud Moulavi, and Jörg Sander. 2013. “Density-Based Clustering Based on Hierarchical Density Estimates.” In *Pacific-Asia Conference on Knowledge Discovery and Data Mining*, 160–72. Springer. .
Campello, Ricardo JGB, Davoud Moulavi, Arthur Zimek, and Joerg Sander. 2015. “Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection.” *ACM Transactions on Knowledge Discovery from Data (TKDD)* 10 (1): 5. .
Ertöz, Levent, Michael Steinbach, and Vipin Kumar. 2003. “Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data.” In *Proceedings of the 2003 SIAM International Conference on Data Mining (SDM)*, 47–58. .
Ester, Martin, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” In *Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96)*, 226–31. .
Hahsler, Michael, Matthew Piekenbrock, and Derek Doran. 2019. “dbscan: Fast Density-Based Clustering with R.” *Journal of Statistical Software* 91 (1): 1–30. .
Jarvis, R. A., and E. A. Patrick. 1973. “Clustering Using a Similarity Measure Based on Shared Near Neighbors.” *IEEE Transactions on Computers* C-22 (11): 1025–34. .
Moulavi, Davoud, Pablo A. Jaskowiak, Ricardo J. G. B. Campello, Arthur Zimek, and Jörg Sander. 2014. “Density-Based Clustering Validation.” In *Proceedings of the 2014 SIAM International Conference on Data Mining (SDM)*, 839–47. .
================================================ FILE: data_src/data_DBCV/dataset_1.txt ================================================ -0.0014755 0.99852 1 -0.005943 0.98904 1 0.028184 1.0181 1 0.019204 1.0041 1 0.033017 1.0128 1 0.011014 0.9857 1 0.033779 1.0033 1 0.045243 1.0096 1 0.02493 0.98413 1 0.064521 1.0185 1 0.032742 0.98149 1 0.042959 0.98645 1 0.049146 0.98734 1 0.05769 0.99058 1 0.070368 0.99792 1 0.070434 0.99262 1 0.09811 1.0149 1 0.078285 0.98967 1 0.096586 1.0025 1 0.10724 1.0077 1 0.083108 0.9781 1 0.088157 0.97763 1 0.092311 0.97624 1 0.10984 0.98821 1 0.12512 0.99789 1 0.13833 1.0055 1 0.12534 0.98686 1 0.13543 0.99127 1 0.13098 0.98113 1 0.14075 0.98519 1 0.16177 1.0005 1 0.13901 0.97193 1 0.14619 0.97331 1 0.14712 0.96842 1 0.16767 0.98311 1 0.19442 1.004 1 0.16394 0.96761 1 0.1977 0.99543 1 0.19514 0.98692 1 0.1946 0.9804 1 0.19852 0.97831 1 0.20655 0.98031 1 0.20457 0.97227 1 0.22232 0.98393 1 0.23737 0.99287 1 0.22462 0.97398 1 0.23313 0.97632 1 0.22676 0.96375 1 0.246 0.97677 1 0.26077 0.98529 1 0.26161 0.97986 1 0.23546 0.9474 1 0.2654 0.97101 1 0.24746 0.9467 1 0.26646 0.95933 1 0.29237 0.97882 1 0.26142 0.94142 1 0.29617 0.9697 1 0.29783 0.96485 1 0.27501 0.93551 1 0.2995 0.95344 1 0.29481 0.94216 1 0.31401 0.95475 1 0.32047 0.95457 1 0.32755 0.95496 1 0.31955 0.94027 1 0.33585 0.94983 1 0.33838 0.9456 1 0.32029 0.92072 1 0.32917 0.92278 1 0.36377 0.95052 1 0.34103 0.9209 1 0.34455 0.9175 1 0.36578 0.93179 1 0.36666 0.92569 1 0.38252 0.93455 1 0.38847 0.93345 1 0.40353 0.94145 1 0.38628 0.91709 1 0.39619 0.91987 1 0.40831 0.92482 1 0.42051 0.92983 1 0.42992 0.932 1 0.41207 0.90689 1 0.41348 0.901 1 0.41216 0.89236 1 0.42511 0.89794 1 0.44358 0.90901 1 0.44485 0.90285 1 0.43699 0.88752 1 0.45736 0.90039 1 0.44539 0.88088 1 0.44175 0.86967 1 0.45383 0.87414 1 0.47455 0.88721 1 0.46535 0.87033 1 0.47352 0.87079 1 0.48349 0.873 1 0.48279 0.86452 1 0.4897 0.86359 1 0.4966 0.86263 1 0.5235 0.88162 1 0.51375 0.86392 1 0.51293 0.85512 1 0.51094 0.8451 1 0.53526 0.86136 1 0.52601 0.84401 1 0.52951 0.83937 1 0.53659 0.83826 1 0.54668 0.84011 1 0.55938 0.84454 1 0.57416 0.85101 1 0.56963 0.83812 1 0.58407 0.84416 1 0.55567 0.80732 1 0.56363 0.80678 1 0.59075 0.82537 1 0.60254 0.82858 1 0.60324 0.82064 1 0.58442 0.79315 1 0.60202 0.80202 1 0.60983 0.80106 1 0.62846 0.81087 1 0.63324 0.80676 1 0.6081 0.77271 1 0.6167 0.77233 1 0.63294 0.77954 1 0.63518 0.7727 1 0.62163 0.75 1 0.63385 0.75303 1 0.64162 0.75155 1 0.63656 0.73719 1 0.66559 0.75686 1 0.65921 0.74105 1 0.67238 0.74474 1 0.66346 0.72628 1 0.69846 0.75167 1 0.6876 0.73114 1 0.69558 0.72939 1 0.67529 0.6993 1 0.69987 0.71402 1 0.69774 0.70195 1 0.71685 0.71106 1 0.69996 0.68408 1 0.70054 0.67451 1 0.72628 0.69003 1 0.72721 0.68066 1 0.74228 0.68534 1 0.75923 0.69184 1 0.73849 0.66055 1 0.74532 0.65676 1 0.76487 0.66559 1 0.76875 0.65868 1 0.78436 0.66339 1 0.76745 0.6355 1 0.77718 0.63414 1 0.79609 0.64187 1 0.76966 0.60415 1 0.77308 0.59619 1 0.80871 0.62031 1 0.79292 0.59292 1 0.80364 0.59192 1 0.81851 0.59494 1 0.81311 0.57757 1 0.81251 0.56488 1 0.80587 0.546 1 0.81022 0.53799 1 0.81768 0.53293 1 0.82183 0.52442 1 0.84505 0.53482 1 0.83976 0.51654 1 0.8495 0.51313 1 0.87442 0.52471 1 0.88198 0.51875 1 0.88773 0.51078 1 0.85546 0.46458 1 0.89719 0.49216 1 0.87605 0.45664 1 0.87376 0.43972 1 0.90767 0.45874 1 0.90549 0.44138 1 0.90785 0.42826 1 0.89003 0.39464 1 0.90694 0.3954 1 0.93518 0.4071 1 0.92258 0.37754 1 0.91917 0.35673 1 0.94426 0.36391 1 0.92657 0.32774 1 0.94763 0.3297 1 0.95621 0.31846 1 0.93664 0.27824 1 0.94663 0.26663 1 0.9509 0.24815 1 0.97853 0.25164 1 0.98948 0.23668 1 0.97915 0.19814 1 0.98452 0.17207 1 0.99067 0.14174 1 0.9892 0.094075 1 0.98787 -0.012127 1 0.0014755 -0.99852 2 0.005943 -0.98904 2 -0.028184 -1.0181 2 -0.019204 -1.0041 2 -0.033017 -1.0128 2 -0.011014 -0.9857 2 -0.033779 -1.0033 2 -0.045243 -1.0096 2 -0.02493 -0.98413 2 -0.064521 -1.0185 2 -0.032742 -0.98149 2 -0.042959 -0.98645 2 -0.049146 -0.98734 2 -0.05769 -0.99058 2 -0.070368 -0.99792 2 -0.070434 -0.99262 2 -0.09811 -1.0149 2 -0.078285 -0.98967 2 -0.096586 -1.0025 2 -0.10724 -1.0077 2 -0.083108 -0.9781 2 -0.088157 -0.97763 2 -0.092311 -0.97624 2 -0.10984 -0.98821 2 -0.12512 -0.99789 2 -0.13833 -1.0055 2 -0.12534 -0.98686 2 -0.13543 -0.99127 2 -0.13098 -0.98113 2 -0.14075 -0.98519 2 -0.16177 -1.0005 2 -0.13901 -0.97193 2 -0.14619 -0.97331 2 -0.14712 -0.96842 2 -0.16767 -0.98311 2 -0.19442 -1.004 2 -0.16394 -0.96761 2 -0.1977 -0.99543 2 -0.19514 -0.98692 2 -0.1946 -0.9804 2 -0.19852 -0.97831 2 -0.20655 -0.98031 2 -0.20457 -0.97227 2 -0.22232 -0.98393 2 -0.23737 -0.99287 2 -0.22462 -0.97398 2 -0.23313 -0.97632 2 -0.22676 -0.96375 2 -0.246 -0.97677 2 -0.26077 -0.98529 2 -0.26161 -0.97986 2 -0.23546 -0.9474 2 -0.2654 -0.97101 2 -0.24746 -0.9467 2 -0.26646 -0.95933 2 -0.29237 -0.97882 2 -0.26142 -0.94142 2 -0.29617 -0.9697 2 -0.29783 -0.96485 2 -0.27501 -0.93551 2 -0.2995 -0.95344 2 -0.29481 -0.94216 2 -0.31401 -0.95475 2 -0.32047 -0.95457 2 -0.32755 -0.95496 2 -0.31955 -0.94027 2 -0.33585 -0.94983 2 -0.33838 -0.9456 2 -0.32029 -0.92072 2 -0.32917 -0.92278 2 -0.36377 -0.95052 2 -0.34103 -0.9209 2 -0.34455 -0.9175 2 -0.36578 -0.93179 2 -0.36666 -0.92569 2 -0.38252 -0.93455 2 -0.38847 -0.93345 2 -0.40353 -0.94145 2 -0.38628 -0.91709 2 -0.39619 -0.91987 2 -0.40831 -0.92482 2 -0.42051 -0.92983 2 -0.42992 -0.932 2 -0.41207 -0.90689 2 -0.41348 -0.901 2 -0.41216 -0.89236 2 -0.42511 -0.89794 2 -0.44358 -0.90901 2 -0.44485 -0.90285 2 -0.43699 -0.88752 2 -0.45736 -0.90039 2 -0.44539 -0.88088 2 -0.44175 -0.86967 2 -0.45383 -0.87414 2 -0.47455 -0.88721 2 -0.46535 -0.87033 2 -0.47352 -0.87079 2 -0.48349 -0.873 2 -0.48279 -0.86452 2 -0.4897 -0.86359 2 -0.4966 -0.86263 2 -0.5235 -0.88162 2 -0.51375 -0.86392 2 -0.51293 -0.85512 2 -0.51094 -0.8451 2 -0.53526 -0.86136 2 -0.52601 -0.84401 2 -0.52951 -0.83937 2 -0.53659 -0.83826 2 -0.54668 -0.84011 2 -0.55938 -0.84454 2 -0.57416 -0.85101 2 -0.56963 -0.83812 2 -0.58407 -0.84416 2 -0.55567 -0.80732 2 -0.56363 -0.80678 2 -0.59075 -0.82537 2 -0.60254 -0.82858 2 -0.60324 -0.82064 2 -0.58442 -0.79315 2 -0.60202 -0.80202 2 -0.60983 -0.80106 2 -0.62846 -0.81087 2 -0.63324 -0.80676 2 -0.6081 -0.77271 2 -0.6167 -0.77233 2 -0.63294 -0.77954 2 -0.63518 -0.7727 2 -0.62163 -0.75 2 -0.63385 -0.75303 2 -0.64162 -0.75155 2 -0.63656 -0.73719 2 -0.66559 -0.75686 2 -0.65921 -0.74105 2 -0.67238 -0.74474 2 -0.66346 -0.72628 2 -0.69846 -0.75167 2 -0.6876 -0.73114 2 -0.69558 -0.72939 2 -0.67529 -0.6993 2 -0.69987 -0.71402 2 -0.69774 -0.70195 2 -0.71685 -0.71106 2 -0.69996 -0.68408 2 -0.70054 -0.67451 2 -0.72628 -0.69003 2 -0.72721 -0.68066 2 -0.74228 -0.68534 2 -0.75923 -0.69184 2 -0.73849 -0.66055 2 -0.74532 -0.65676 2 -0.76487 -0.66559 2 -0.76875 -0.65868 2 -0.78436 -0.66339 2 -0.76745 -0.6355 2 -0.77718 -0.63414 2 -0.79609 -0.64187 2 -0.76966 -0.60415 2 -0.77308 -0.59619 2 -0.80871 -0.62031 2 -0.79292 -0.59292 2 -0.80364 -0.59192 2 -0.81851 -0.59494 2 -0.81311 -0.57757 2 -0.81251 -0.56488 2 -0.80587 -0.546 2 -0.81022 -0.53799 2 -0.81768 -0.53293 2 -0.82183 -0.52442 2 -0.84505 -0.53482 2 -0.83976 -0.51654 2 -0.8495 -0.51313 2 -0.87442 -0.52471 2 -0.88198 -0.51875 2 -0.88773 -0.51078 2 -0.85546 -0.46458 2 -0.89719 -0.49216 2 -0.87605 -0.45664 2 -0.87376 -0.43972 2 -0.90767 -0.45874 2 -0.90549 -0.44138 2 -0.90785 -0.42826 2 -0.89003 -0.39464 2 -0.90694 -0.3954 2 -0.93518 -0.4071 2 -0.92258 -0.37754 2 -0.91917 -0.35673 2 -0.94426 -0.36391 2 -0.92657 -0.32774 2 -0.94763 -0.3297 2 -0.95621 -0.31846 2 -0.93664 -0.27824 2 -0.94663 -0.26663 2 -0.9509 -0.24815 2 -0.97853 -0.25164 2 -0.98948 -0.23668 2 -0.97915 -0.19814 2 -0.98452 -0.17207 2 -0.99067 -0.14174 2 -0.9892 -0.094075 2 -0.98787 0.012127 2 -0.0029509 1.997 3 -0.011886 1.9781 3 0.056369 2.0363 3 0.038408 2.0082 3 0.066034 2.0256 3 0.022028 1.9714 3 0.067558 2.0067 3 0.090485 2.0193 3 0.04986 1.9683 3 0.12904 2.037 3 0.065484 1.963 3 0.085919 1.9729 3 0.098292 1.9747 3 0.11538 1.9812 3 0.14074 1.9958 3 0.14087 1.9852 3 0.19622 2.0298 3 0.15657 1.9793 3 0.19317 2.0051 3 0.21449 2.0154 3 0.16622 1.9562 3 0.17631 1.9553 3 0.18462 1.9525 3 0.21968 1.9764 3 0.25024 1.9958 3 0.27666 2.011 3 0.25069 1.9737 3 0.27086 1.9825 3 0.26197 1.9623 3 0.28151 1.9704 3 0.32354 2.0009 3 0.27802 1.9439 3 0.29239 1.9466 3 0.29424 1.9368 3 0.33533 1.9662 3 0.38883 2.008 3 0.32789 1.9352 3 0.39539 1.9909 3 0.39027 1.9738 3 0.38919 1.9608 3 0.39703 1.9566 3 0.41309 1.9606 3 0.40914 1.9445 3 0.44464 1.9679 3 0.47475 1.9857 3 0.44924 1.948 3 0.46626 1.9526 3 0.45351 1.9275 3 0.492 1.9535 3 0.52153 1.9706 3 0.52323 1.9597 3 0.47091 1.8948 3 0.5308 1.942 3 0.49491 1.8934 3 0.53293 1.9187 3 0.58475 1.9576 3 0.52284 1.8828 3 0.59234 1.9394 3 0.59565 1.9297 3 0.55002 1.871 3 0.599 1.9069 3 0.58962 1.8843 3 0.62803 1.9095 3 0.64095 1.9091 3 0.65509 1.9099 3 0.63911 1.8805 3 0.6717 1.8997 3 0.67676 1.8912 3 0.64059 1.8414 3 0.65835 1.8456 3 0.72754 1.901 3 0.68206 1.8418 3 0.68909 1.835 3 0.73157 1.8636 3 0.73332 1.8514 3 0.76505 1.8691 3 0.77693 1.8669 3 0.80707 1.8829 3 0.77256 1.8342 3 0.79239 1.8397 3 0.81662 1.8496 3 0.84103 1.8597 3 0.85983 1.864 3 0.82415 1.8138 3 0.82696 1.802 3 0.82433 1.7847 3 0.85022 1.7959 3 0.88716 1.818 3 0.88971 1.8057 3 0.87398 1.775 3 0.91472 1.8008 3 0.89078 1.7618 3 0.8835 1.7393 3 0.90766 1.7483 3 0.94909 1.7744 3 0.9307 1.7407 3 0.94704 1.7416 3 0.96697 1.746 3 0.96559 1.729 3 0.9794 1.7272 3 0.99321 1.7253 3 1.047 1.7632 3 1.0275 1.7278 3 1.0259 1.7102 3 1.0219 1.6902 3 1.0705 1.7227 3 1.052 1.688 3 1.059 1.6787 3 1.0732 1.6765 3 1.0934 1.6802 3 1.1188 1.6891 3 1.1483 1.702 3 1.1393 1.6762 3 1.1681 1.6883 3 1.1113 1.6146 3 1.1273 1.6136 3 1.1815 1.6507 3 1.2051 1.6572 3 1.2065 1.6413 3 1.1688 1.5863 3 1.204 1.604 3 1.2197 1.6021 3 1.2569 1.6217 3 1.2665 1.6135 3 1.2162 1.5454 3 1.2334 1.5447 3 1.2659 1.5591 3 1.2704 1.5454 3 1.2433 1.5 3 1.2677 1.5061 3 1.2832 1.5031 3 1.2731 1.4744 3 1.3312 1.5137 3 1.3184 1.4821 3 1.3448 1.4895 3 1.3269 1.4526 3 1.3969 1.5033 3 1.3752 1.4623 3 1.3912 1.4588 3 1.3506 1.3986 3 1.3997 1.428 3 1.3955 1.4039 3 1.4337 1.4221 3 1.3999 1.3682 3 1.4011 1.349 3 1.4526 1.3801 3 1.4544 1.3613 3 1.4846 1.3707 3 1.5185 1.3837 3 1.477 1.3211 3 1.4906 1.3135 3 1.5297 1.3312 3 1.5375 1.3174 3 1.5687 1.3268 3 1.5349 1.271 3 1.5544 1.2683 3 1.5922 1.2837 3 1.5393 1.2083 3 1.5462 1.1924 3 1.6174 1.2406 3 1.5858 1.1858 3 1.6073 1.1838 3 1.637 1.1899 3 1.6262 1.1551 3 1.625 1.1298 3 1.6117 1.092 3 1.6204 1.076 3 1.6354 1.0659 3 1.6437 1.0488 3 1.6901 1.0696 3 1.6795 1.0331 3 1.699 1.0263 3 1.7488 1.0494 3 1.764 1.0375 3 1.7755 1.0216 3 1.7109 0.92917 3 1.7944 0.98432 3 1.7521 0.91328 3 1.7475 0.87945 3 1.8153 0.91747 3 1.811 0.88277 3 1.8157 0.85652 3 1.7801 0.78928 3 1.8139 0.79079 3 1.8704 0.8142 3 1.8452 0.75509 3 1.8383 0.71346 3 1.8885 0.72782 3 1.8531 0.65549 3 1.8953 0.65939 3 1.9124 0.63693 3 1.8733 0.55649 3 1.8933 0.53327 3 1.9018 0.49629 3 1.9571 0.50328 3 1.979 0.47337 3 1.9583 0.39629 3 1.969 0.34415 3 1.9813 0.28348 3 1.9784 0.18815 3 1.9757 -0.024254 3 0.0029509 -1.997 4 0.011886 -1.9781 4 -0.056369 -2.0363 4 -0.038408 -2.0082 4 -0.066034 -2.0256 4 -0.022028 -1.9714 4 -0.067558 -2.0067 4 -0.090485 -2.0193 4 -0.04986 -1.9683 4 -0.12904 -2.037 4 -0.065484 -1.963 4 -0.085919 -1.9729 4 -0.098292 -1.9747 4 -0.11538 -1.9812 4 -0.14074 -1.9958 4 -0.14087 -1.9852 4 -0.19622 -2.0298 4 -0.15657 -1.9793 4 -0.19317 -2.0051 4 -0.21449 -2.0154 4 -0.16622 -1.9562 4 -0.17631 -1.9553 4 -0.18462 -1.9525 4 -0.21968 -1.9764 4 -0.25024 -1.9958 4 -0.27666 -2.011 4 -0.25069 -1.9737 4 -0.27086 -1.9825 4 -0.26197 -1.9623 4 -0.28151 -1.9704 4 -0.32354 -2.0009 4 -0.27802 -1.9439 4 -0.29239 -1.9466 4 -0.29424 -1.9368 4 -0.33533 -1.9662 4 -0.38883 -2.008 4 -0.32789 -1.9352 4 -0.39539 -1.9909 4 -0.39027 -1.9738 4 -0.38919 -1.9608 4 -0.39703 -1.9566 4 -0.41309 -1.9606 4 -0.40914 -1.9445 4 -0.44464 -1.9679 4 -0.47475 -1.9857 4 -0.44924 -1.948 4 -0.46626 -1.9526 4 -0.45351 -1.9275 4 -0.492 -1.9535 4 -0.52153 -1.9706 4 -0.52323 -1.9597 4 -0.47091 -1.8948 4 -0.5308 -1.942 4 -0.49491 -1.8934 4 -0.53293 -1.9187 4 -0.58475 -1.9576 4 -0.52284 -1.8828 4 -0.59234 -1.9394 4 -0.59565 -1.9297 4 -0.55002 -1.871 4 -0.599 -1.9069 4 -0.58962 -1.8843 4 -0.62803 -1.9095 4 -0.64095 -1.9091 4 -0.65509 -1.9099 4 -0.63911 -1.8805 4 -0.6717 -1.8997 4 -0.67676 -1.8912 4 -0.64059 -1.8414 4 -0.65835 -1.8456 4 -0.72754 -1.901 4 -0.68206 -1.8418 4 -0.68909 -1.835 4 -0.73157 -1.8636 4 -0.73332 -1.8514 4 -0.76505 -1.8691 4 -0.77693 -1.8669 4 -0.80707 -1.8829 4 -0.77256 -1.8342 4 -0.79239 -1.8397 4 -0.81662 -1.8496 4 -0.84103 -1.8597 4 -0.85983 -1.864 4 -0.82415 -1.8138 4 -0.82696 -1.802 4 -0.82433 -1.7847 4 -0.85022 -1.7959 4 -0.88716 -1.818 4 -0.88971 -1.8057 4 -0.87398 -1.775 4 -0.91472 -1.8008 4 -0.89078 -1.7618 4 -0.8835 -1.7393 4 -0.90766 -1.7483 4 -0.94909 -1.7744 4 -0.9307 -1.7407 4 -0.94704 -1.7416 4 -0.96697 -1.746 4 -0.96559 -1.729 4 -0.9794 -1.7272 4 -0.99321 -1.7253 4 -1.047 -1.7632 4 -1.0275 -1.7278 4 -1.0259 -1.7102 4 -1.0219 -1.6902 4 -1.0705 -1.7227 4 -1.052 -1.688 4 -1.059 -1.6787 4 -1.0732 -1.6765 4 -1.0934 -1.6802 4 -1.1188 -1.6891 4 -1.1483 -1.702 4 -1.1393 -1.6762 4 -1.1681 -1.6883 4 -1.1113 -1.6146 4 -1.1273 -1.6136 4 -1.1815 -1.6507 4 -1.2051 -1.6572 4 -1.2065 -1.6413 4 -1.1688 -1.5863 4 -1.204 -1.604 4 -1.2197 -1.6021 4 -1.2569 -1.6217 4 -1.2665 -1.6135 4 -1.2162 -1.5454 4 -1.2334 -1.5447 4 -1.2659 -1.5591 4 -1.2704 -1.5454 4 -1.2433 -1.5 4 -1.2677 -1.5061 4 -1.2832 -1.5031 4 -1.2731 -1.4744 4 -1.3312 -1.5137 4 -1.3184 -1.4821 4 -1.3448 -1.4895 4 -1.3269 -1.4526 4 -1.3969 -1.5033 4 -1.3752 -1.4623 4 -1.3912 -1.4588 4 -1.3506 -1.3986 4 -1.3997 -1.428 4 -1.3955 -1.4039 4 -1.4337 -1.4221 4 -1.3999 -1.3682 4 -1.4011 -1.349 4 -1.4526 -1.3801 4 -1.4544 -1.3613 4 -1.4846 -1.3707 4 -1.5185 -1.3837 4 -1.477 -1.3211 4 -1.4906 -1.3135 4 -1.5297 -1.3312 4 -1.5375 -1.3174 4 -1.5687 -1.3268 4 -1.5349 -1.271 4 -1.5544 -1.2683 4 -1.5922 -1.2837 4 -1.5393 -1.2083 4 -1.5462 -1.1924 4 -1.6174 -1.2406 4 -1.5858 -1.1858 4 -1.6073 -1.1838 4 -1.637 -1.1899 4 -1.6262 -1.1551 4 -1.625 -1.1298 4 -1.6117 -1.092 4 -1.6204 -1.076 4 -1.6354 -1.0659 4 -1.6437 -1.0488 4 -1.6901 -1.0696 4 -1.6795 -1.0331 4 -1.699 -1.0263 4 -1.7488 -1.0494 4 -1.764 -1.0375 4 -1.7755 -1.0216 4 -1.7109 -0.92917 4 -1.7944 -0.98432 4 -1.7521 -0.91328 4 -1.7475 -0.87945 4 -1.8153 -0.91747 4 -1.811 -0.88277 4 -1.8157 -0.85652 4 -1.7801 -0.78928 4 -1.8139 -0.79079 4 -1.8704 -0.8142 4 -1.8452 -0.75509 4 -1.8383 -0.71346 4 -1.8885 -0.72782 4 -1.8531 -0.65549 4 -1.8953 -0.65939 4 -1.9124 -0.63693 4 -1.8733 -0.55649 4 -1.8933 -0.53327 4 -1.9018 -0.49629 4 -1.9571 -0.50328 4 -1.979 -0.47337 4 -1.9583 -0.39629 4 -1.969 -0.34415 4 -1.9813 -0.28348 4 -1.9784 -0.18815 4 -1.9757 0.024254 4 1.4303 -1.0155 -1 -0.47685 -0.96563 -1 0.84056 1.4012 -1 0.093202 -0.41791 -1 -0.54094 -1.6109 -1 -0.25885 -1.2472 -1 0.74337 -0.55785 -1 -1.0824 1.5259 -1 1.8981 0.40646 -1 1.8849 -0.98545 -1 -0.83407 -0.57677 -1 -0.64022 1.5788 -1 1.9672 1.6318 -1 1.1451 -0.21204 -1 1.1687 -0.9417 -1 0.52452 0.21924 -1 1.2342 -1.3084 -1 -0.20569 1.4654 -1 1.3101 -1.0919 -1 -1.4794 -1.3521 -1 0.052576 -1.9281 -1 0.85565 -0.72342 -1 -0.998 0.22474 -1 0.12641 1.3221 -1 -0.46676 1.2395 -1 1.1958 -1.9376 -1 0.67705 -0.52349 -1 1.9134 -0.033122 -1 1.7309 -0.1383 -1 0.30224 -1.8671 -1 -1.6636 0.47667 -1 -0.34148 0.31791 -1 -1.2647 -0.81965 -1 1.964 -0.2621 -1 0.080782 -1.4804 -1 1.5267 -0.81594 -1 0.58746 1.0648 -1 -0.13372 -1.8932 -1 -1.6037 -0.93906 -1 1.8538 2.0218 -1 0.47595 -0.21614 -1 -1.3631 -1.4146 -1 -0.40273 1.5735 -1 1.5157 -1.9092 -1 0.1546 -1.5643 -1 0.17307 -1.015 -1 -0.22804 1.0579 -1 -1.2532 1.6227 -1 -0.9937 -1.1268 -1 -0.85152 0.70602 -1 0.11693 1.2987 -1 0.23711 1.8289 -1 -0.33624 1.525 -1 1.6075 -0.43292 -1 -0.77214 1.7802 -1 0.59348 -0.25709 -1 -0.83697 -1.3749 -1 -0.96984 -0.77479 -1 -0.56196 0.73784 -1 1.2122 1.7683 -1 0.15425 1.8227 -1 0.35689 0.40366 -1 -1.0654 1.8287 -1 -1.5773 -0.39103 -1 0.57317 -1.8698 -1 1.9026 -0.83995 -1 -1.5782 -1.9069 -1 -1.2369 1.485 -1 -1.9441 -0.27481 -1 1.3406 -1.6589 -1 -0.073933 -1.4756 -1 -0.1247 -1.0512 -1 1.6189 -1.1285 -1 -0.32831 1.4982 -1 0.1749 1.0763 -1 0.78859 -0.63263 -1 -1.6681 -0.46941 -1 0.037311 0.38648 -1 -0.051917 0.14308 -1 1.4102 -0.67809 -1 0.45334 1.445 -1 -1.516 -0.95477 -1 0.42349 1.7679 -1 -1.3307 -0.44882 -1 -0.40012 0.74581 -1 0.12822 -0.91661 -1 1.4868 -1.9231 -1 0.63021 1.7951 -1 1.1397 0.1384 -1 -1.4819 0.69736 -1 0.098963 0.4381 -1 1.583 1.0221 -1 -1.549 1.9609 -1 0.53325 0.92753 -1 -1.6609 1.4557 -1 -0.35175 2.0038 -1 0.84258 1.057 -1 -1.5834 -1.442 -1 1.2282 -0.70763 -1 0.54608 -1.9197 -1 1.5774 0.7926 -1 0.48273 1.869 -1 -0.33838 0.93314 -1 0.58471 0.96454 -1 -0.042523 -1.3256 -1 -1.6098 -0.58906 -1 0.54416 0.30412 -1 1.7842 -0.16318 -1 -0.093611 1.3596 -1 0.40738 1.2851 -1 0.36251 -0.71722 -1 -1.0887 -0.1561 -1 0.66743 0.70871 -1 -1.3609 0.38795 -1 1.0867 -1.4895 -1 -1.1371 -1.9576 -1 -1.3111 -1.5273 -1 0.89457 -1.1274 -1 -0.96612 -0.20721 -1 -1.3363 0.14068 -1 0.4984 1.9978 -1 ================================================ FILE: data_src/data_DBCV/dataset_2.txt ================================================ 191.67 388.02 1 186.28 383.39 1 182.22 397.99 1 194.54 394.76 1 183.43 393.87 1 184.23 388.09 1 192.33 389.85 1 190.66 379.92 1 195.57 391.06 1 191.96 385.75 1 199.7 389.03 1 198.24 396.81 1 193.82 392.53 1 199.6 389 1 183.64 380.64 1 197.05 391.36 1 184.78 385 1 191.82 380.51 1 195.54 391.6 1 201.36 396.71 1 191.57 382.8 1 188.86 394.4 1 193.18 394.9 1 193.52 383.23 1 190.54 390.52 1 193.54 380.93 1 190.13 385.28 1 189.19 389.93 1 196.69 396.46 1 184.35 384.27 1 187.22 388 1 200.74 378.8 1 186.19 394.45 1 183.39 391.57 1 191.16 391.98 1 192.17 384.25 1 191.2 381.73 1 197.37 390.6 1 187.18 396.36 1 185.36 395.56 1 185 378.42 1 200.7 378.48 1 189.43 380.73 1 201.74 385.78 1 191.41 393.4 1 190.49 396.99 1 183 384.57 1 192.84 394.45 1 188.91 393.37 1 195.13 381.42 1 192.55 389.33 1 188.18 379.04 1 194.63 390.13 1 195.78 384.71 1 194 384.32 1 201.64 392.2 1 189.08 384.43 1 193.18 385.76 1 186.7 380.63 1 193.05 389.65 1 192.87 391.51 1 200.06 383.44 1 187.65 389.69 1 185.65 393.05 1 194.62 385.77 1 200.27 395.52 1 190.14 380.87 1 201.36 386.43 1 197.19 380.52 1 194.06 380.94 1 190.65 381.92 1 185.66 393.78 1 192.36 383.31 1 195.77 390.8 1 186.62 389.32 1 188.21 390.11 1 192.65 396.48 1 195.48 390.87 1 200.97 385.22 1 184.95 393.58 1 197.78 386.29 1 186.87 380.27 1 189.26 386.77 1 190.07 379.13 1 200.59 382.33 1 188.67 396.58 1 200.17 395.49 1 201.76 385.96 1 192.04 397.27 1 192.75 383.17 1 187.46 382.41 1 340.41 481.39 2 340.1 495.16 2 344.78 481.21 2 331.12 496.46 2 340.69 487.92 2 335.3 482.55 2 337.23 499.25 2 342.68 494.51 2 328.22 495.53 2 339.51 486.51 2 341.08 493.48 2 328.75 495.01 2 326.3 488.38 2 327.81 498.99 2 334.28 487.46 2 326.35 492.18 2 341.09 498.63 2 338.57 499.69 2 335.41 492.93 2 332.2 493.39 2 337.98 485.14 2 336.41 483.31 2 339.96 493.2 2 343.33 486.77 2 341.46 485.77 2 330.26 493.31 2 332.52 484.28 2 326.43 499.75 2 328.52 499.14 2 338.5 484.65 2 344.84 482.29 2 334.78 494.12 2 326.04 485.07 2 329.06 481.9 2 331.54 494.11 2 328.48 485.08 2 337.45 499.97 2 339.55 499.48 2 337.43 495.51 2 327.76 494.22 2 330.44 492.63 2 339.49 487.74 2 336.16 482.82 2 341.06 485.45 2 339.48 488.7 2 330.98 480.68 2 331.57 484.42 2 343.33 484.18 2 328.31 488.15 2 334.89 498.58 2 342.32 497.03 2 332.51 487.29 2 326.03 491.16 2 341.69 486.9 2 338.1 492.02 2 332.05 497.13 2 339.75 485.09 2 333.82 484.08 2 329.32 499.44 2 332.68 488.99 2 327.92 487.37 2 337.22 480.92 2 336.15 488.91 2 333.69 490.45 2 326.04 486.61 2 334.56 492.73 2 333.88 489.53 2 337.59 488.77 2 340.21 492.08 2 339.8 493.36 2 329.76 497.27 2 340.95 482.98 2 338.42 484.3 2 344.52 499.25 2 327.57 499.96 2 329.93 499.83 2 335.57 480.24 2 333.34 488.34 2 337.02 493.53 2 340.54 482.09 2 325.4 482.11 2 334.95 494.09 2 336.14 495 2 326.32 495.45 2 332.61 484.52 2 338.13 484.78 2 336.18 494.65 2 331.86 493.36 2 332.1 496.01 2 344.72 488.13 2 294.98 518.49 3 291.23 516.72 3 288.05 515.34 3 278.89 524.17 3 275.93 521.61 3 282.55 514.63 3 281.36 518.21 3 279.33 518.85 3 283.84 525.99 3 288.43 529.03 3 291.04 528.4 3 278.14 522.74 3 281.71 515.52 3 278.62 526.96 3 291.96 525.89 3 287.65 513.1 3 286.88 512.23 3 285.28 527.11 3 280.91 526.84 3 283.04 531.88 3 285.71 523.52 3 281.09 523.43 3 294.47 517.23 3 285.36 515.44 3 280.82 507.61 3 292.46 516.93 3 288.01 519.53 3 287.09 524.37 3 289.28 514.53 3 278.14 515.27 3 280.88 506.35 3 290.47 508.46 3 286.48 501.99 3 289.54 509.65 3 284.03 505.59 3 290.56 510.6 3 283.11 500.77 3 292.86 512.83 3 280.09 510.88 3 288.59 515.62 3 293.92 504.93 3 283.62 505.76 3 289 500.88 3 284.38 493.54 3 281.44 496.12 3 290.95 496.9 3 293.05 488.36 3 276.16 489.8 3 278.67 505.62 3 283.59 501.08 3 286.26 492.08 3 291.35 490.49 3 288.35 487.82 3 282.77 477.32 3 283.58 480.83 3 292.1 477.41 3 294.59 481.41 3 285.3 479.59 3 279.84 489.81 3 293.43 491.57 3 280.47 469.46 3 279.58 471 3 291.63 475.93 3 291.74 468.47 3 288.93 466.17 3 276.82 482.18 3 282.36 481.33 3 288.45 471.95 3 288.05 469.6 3 276.47 479.88 3 300.56 466.5 3 298.47 471.81 3 284.07 481.36 3 287.38 464.73 3 284.4 473.29 3 287.97 480.09 3 298.61 474.17 3 289.85 469.67 3 283.63 464.57 3 298.97 464.74 3 291.27 458.95 3 294.18 463.21 3 294.48 465.85 3 289.87 468.79 3 290.94 459.28 3 296.98 458.08 3 280.27 454.63 3 285.93 466.91 3 278.23 463.41 3 282.53 454.21 3 288.59 454.81 3 276.72 457.2 3 285.36 461.91 3 277.37 457.95 3 278.49 447.63 3 293.35 448.16 3 291.39 449.97 3 294.06 448.98 3 293.1 455.09 3 287.31 453.59 3 282.92 456.89 3 285.34 460.96 3 277.37 443.03 3 285.76 448.16 3 290.56 452.58 3 292.64 460.18 3 280.59 460.58 3 277.92 446.65 3 287.88 447.08 3 286.07 446.54 3 283.37 453.01 3 285.11 438.68 3 278.74 440.95 3 283.6 443.65 3 284.92 444.38 3 287.73 453.63 3 277.89 445.57 3 289.52 436.22 3 295.98 436.38 3 287.3 436.81 3 296.1 441.71 3 292.47 447.83 3 289.95 450.92 3 297.99 443.04 3 297 434.36 3 296.57 445.3 3 299.79 440.01 3 299.96 442.62 3 299.68 439.13 3 296.39 436.12 3 320.92 454.47 3 310.97 444.94 3 323.03 452.19 3 309.97 447.02 3 311.33 456.66 3 320.74 452.42 3 323.85 458.09 3 305.53 448.24 3 307.96 450.16 3 318.1 460.06 3 307.99 450.18 3 306.09 447.03 3 314.32 446.33 3 310.71 454.38 3 318.18 440.03 3 317.17 448.15 3 314.51 454.5 3 314.28 444.67 3 315.05 449.01 3 310.99 457.54 3 313.12 441.72 3 309.27 445.33 3 309.35 444.71 3 311.14 443.3 3 305.75 437.12 3 309.01 455.41 3 312.25 437.3 3 305.43 442.71 3 309.84 453.82 3 305.52 444.09 3 321.73 441.87 3 314.2 439.02 3 329.11 440.43 3 316.15 455.43 3 316.5 454.81 3 314.86 452.27 3 323.51 448.13 3 324 439.58 3 322.74 448.47 3 322.93 447.08 3 335.27 437.48 3 338.28 451.23 3 328.09 447.29 3 322.51 449.93 3 323.06 450.62 3 331.6 452.04 3 334.17 449.54 3 330.58 439.86 3 327.51 450.65 3 335.91 449.43 3 343.39 443.55 3 331.72 435.02 3 336.7 447.63 3 330.01 450.15 3 328.63 448.64 3 329.16 436.68 3 327.53 440.84 3 332.23 452.85 3 330.02 447.25 3 328.79 452.26 3 340.74 442.92 3 353.55 435.51 3 345.18 445.63 3 337.29 440.98 3 343.28 435.96 3 341.23 445.12 3 355.15 435.15 3 345.11 444.24 3 339.09 437.76 3 340.62 442.32 3 351.08 437.81 3 355.65 442.49 3 357.24 446.59 3 361.65 444.11 3 345.92 445.21 3 349.02 451.95 3 348.38 438.95 3 358.49 451.57 3 345.9 441.06 3 360.79 449.67 3 358.19 447.11 3 351.18 457.04 3 355.48 453.5 3 354.18 451.7 3 351.94 452.31 3 365.01 439.53 3 369.88 441.77 3 359.26 454.45 3 355.58 455.95 3 353.81 440.37 3 361.36 454.94 3 368.56 447.07 3 375.49 443.25 3 366.36 450.91 3 363.16 455.29 3 371.11 458.19 3 372.98 448 3 373.03 447.92 3 374.39 457.3 3 372.26 445.73 3 377.06 450.13 3 379.63 439.73 3 372.1 441.27 3 383.96 458.63 3 371.14 456.3 3 367.27 441.51 3 381.39 453.48 3 381.73 455 3 366.89 451.22 3 377.3 444.36 3 380.11 448.47 3 379.44 449.11 3 376.58 457.72 3 372.81 456.82 3 382.33 462.54 3 386.07 457.32 3 378.29 448.74 3 373.19 450.57 3 370.29 444.36 3 383.03 452.18 3 368.76 464.2 3 384.78 457.26 3 383.27 468.2 3 369.62 471.17 3 372.26 465.96 3 371.47 468.03 3 380.38 467.98 3 383.79 455.89 3 385.78 457.04 3 380.57 470.4 3 382.35 480.17 3 386.18 481.18 3 378.57 471.25 3 381.73 468.53 3 376.54 466.42 3 368.33 466.71 3 372.74 474.28 3 382.27 482.73 3 384.97 466.96 3 372.55 468.1 3 378.96 486.39 3 380.22 496.84 3 375.67 493.78 3 366.74 495.22 3 367.36 484.87 3 366.04 493.07 3 366.34 488.55 3 376.13 492.72 3 374.27 494.8 3 371.04 489.64 3 373.75 473.85 3 378.51 489.22 3 385.28 490.17 3 372.82 490.59 3 372.1 476.8 3 370.43 475.76 3 384.99 474.93 3 385.69 476.61 3 381.26 479.32 3 372.69 488.47 3 381.36 492.11 3 384.11 474.87 3 368.05 475.11 3 374.83 473.57 3 369.97 484.63 3 371.07 475.87 3 366.76 489.93 3 384.18 482.18 3 385.75 492.76 3 368.73 488.29 3 385.87 493.03 3 377.38 499.81 3 384.49 495.86 3 372.36 495.33 3 375.01 501.77 3 375.62 488.1 3 379.96 501.58 3 370.54 498.8 3 383.35 503.41 3 371.08 490.36 3 370.21 495.93 3 373.48 514.8 3 370.55 504.81 3 370.71 506.46 3 371.66 499.01 3 377.41 502.92 3 367.2 513.39 3 377.92 514.13 3 384.16 514.71 3 385.25 505.71 3 373.21 523.99 3 377.33 518.61 3 385.47 514.76 3 375.76 521.69 3 371.97 523.57 3 372.11 523.87 3 370.64 515.02 3 366.48 514.77 3 378.54 523.61 3 383.2 513.48 3 367.42 522.37 3 385.56 520.12 3 368.75 523.87 3 384.73 533.84 3 377.31 529.91 3 376.84 529.64 3 378.41 535.94 3 369.63 528.13 3 366.75 524.39 3 366.24 522.79 3 377.66 536.96 3 378.88 541.38 3 372.15 528.05 3 373.61 537.04 3 366.17 522.84 3 383.47 527.33 3 383.24 522.12 3 367.52 527.08 3 373.86 535.5 3 382.26 535.66 3 357.93 535.36 3 365.68 532.88 3 363.11 538.57 3 350.31 534.83 3 365.66 535.6 3 359.09 526.82 3 358.05 533.23 3 367.55 529.39 3 361.73 527.94 3 349.54 536.64 3 347.31 543.75 3 346.07 536.35 3 351.77 531.47 3 353.46 529.62 3 354.41 533.82 3 361.57 541.99 3 346.78 545.99 3 344.67 536.04 3 361.73 540.08 3 355.75 544.66 3 346.84 539.93 3 343.98 538.04 3 342.01 536.7 3 335.88 525.71 3 338 533.94 3 338.97 526.9 3 353.23 530.62 3 338.4 540.83 3 341.43 533.65 3 336.62 535.57 3 338.84 535.99 3 336.84 529.85 3 325.93 534.64 3 329.93 528.94 3 327.31 526.5 3 342.67 535.84 3 325.67 540.26 3 335.96 529.47 3 324.81 530.54 3 323.57 531.35 3 330.93 539.85 3 325.2 527.89 3 314.42 533.91 3 317.52 532.12 3 329.36 531.92 3 318.32 542.56 3 321.96 540.29 3 322.88 530.85 3 328.42 530.82 3 323.48 524.62 3 313.88 542.08 3 319.01 525.49 3 323.61 529.62 3 320.88 535.79 3 306.95 532.76 3 315.62 541.98 3 316.32 525.54 3 307.04 539.87 3 313.11 543.76 3 317.78 533.33 3 304.88 538.15 3 310.86 537.3 3 306.53 527.56 3 293.92 539.02 3 295.26 525.31 3 298.32 530.93 3 307.76 535.17 3 303.6 528.51 3 295.49 540.66 3 303.73 529.11 3 302.05 532.02 3 302.8 531.32 3 295.21 533.72 3 286.11 528.52 3 296.5 531.68 3 290.35 537.34 3 302.04 536.13 3 285.01 531.55 3 292.5 541.61 3 302.73 526.73 3 286.09 543.6 3 286.95 541.62 3 288.73 525.62 3 291.94 525.4 3 284.91 535.21 3 281.74 536.65 3 282.97 536.73 3 279.31 541.51 3 282.08 529.34 3 288.73 525.19 3 306.37 533.73 3 290.12 539.37 3 294.4 534.15 3 296.61 526.55 3 306.91 536.5 3 306 526.44 3 291.47 542.59 3 305.58 525.35 3 297.59 544.89 3 288.33 527.02 3 305.29 540.79 3 311.39 536.2 3 310.13 526.2 3 318.07 544.46 3 309.02 529.11 3 305.32 536.69 3 317.93 532.3 3 320.65 533.34 3 319.57 542.11 3 308.2 539.44 3 333.5 529.38 3 336.7 535.47 3 318.03 528.88 3 327.08 530.98 3 329.83 531.53 3 323.47 532.39 3 328.83 531.74 3 337.92 543.6 3 329.95 528.19 3 326.37 543.61 3 338.88 534.78 3 333.09 535.94 3 342.69 538.79 3 346.23 542.77 3 334.57 526.75 3 341.95 524.57 3 333.56 541.51 3 331.04 527.26 3 345.95 537.33 3 348.59 528.64 3 352.97 539.86 3 346.53 542.73 3 343.8 542.63 3 341.86 531.31 3 340.08 539.25 3 359.03 532.22 3 340.99 531.07 3 340.08 536.54 3 359.73 533.9 3 351.63 527.78 3 372.06 524.72 3 361.89 541.77 3 359.29 542.9 3 356.45 535.49 3 369.13 525.67 3 361.68 539.93 3 364.28 532.1 3 374.01 538.93 3 370.34 535.18 3 358.54 542.05 3 372.08 528.04 3 372.39 542.33 3 370.01 537.32 3 368.39 525.05 3 369.99 533.21 3 371.23 533.56 3 382.96 526.72 3 371.94 527.73 3 368.82 527 3 377.01 538.67 3 380.64 509.76 3 379.15 510.5 3 381.52 516.9 3 385.61 509.08 3 385.72 511.69 3 385.25 520.56 3 372.08 521.03 3 373.05 527.4 3 375.92 515 3 380.46 508.53 3 381.17 509.89 3 377.5 498.2 3 378.08 515.71 3 383.04 504.89 3 374.27 498.43 3 371.71 507.72 3 382.35 500.94 3 374.14 512.62 3 372.42 498.97 3 375.79 497.16 3 383.88 449.14 3 375.92 450.35 3 378.77 441.93 3 369.78 441.17 3 377.04 443.58 3 382.24 446.89 3 370.4 456.44 3 371.94 447.36 3 369.27 445.18 3 386.51 454.03 3 307.62 450.76 3 293.17 456.52 3 291.63 444.09 3 303.18 444.98 3 308.05 451.8 3 298.22 448.06 3 308.96 446.43 3 306.11 459.63 3 295.57 453.74 3 293.5 453.79 3 305.05 445.18 3 294.13 455.98 3 289.97 444.39 3 296.05 451.41 3 292.94 442.44 3 293.6 442.95 3 306.73 455.38 3 302.1 441.24 3 297.24 443.52 3 305.96 459.49 3 282.21 479.76 3 295.87 491.26 3 285.18 491.68 3 292.34 478.29 3 294.6 484.44 3 295.86 490.94 3 285.59 490.64 3 277.5 488.55 3 282.95 483.96 3 294.89 478 3 290.23 498.08 3 294.37 506.82 3 283.9 501.09 3 292.28 502.85 3 283.46 506.98 3 293.4 499.57 3 292.27 500.25 3 283.43 492.35 3 289.43 490.77 3 281.36 509.62 3 283.61 494.48 3 278.76 498.29 3 276.56 482 3 279.43 485.63 3 276.21 493.2 3 279.88 482.74 3 285 481.02 3 284.56 487 3 293.44 490.91 3 291.58 485.65 3 198.3 458.14 4 203.89 469.93 4 199.19 463.46 4 198.64 454.27 4 196.65 451.89 4 199.64 464.95 4 207.5 464.11 4 188.17 464.08 4 193.09 458.38 4 203.07 451.7 4 197.56 457.59 4 202.17 466.1 4 200.91 465.19 4 196.98 459.22 4 205.31 463.32 4 195.67 469.62 4 201.69 461.65 4 191.84 464.67 4 191 451.79 4 200.77 466.85 4 201.71 451.69 4 192.59 451.93 4 200.48 453.18 4 194.89 447.26 4 197.26 465.19 4 200.44 466.05 4 196.99 465.68 4 203.72 452.95 4 206.59 454.63 4 207.58 464.62 4 202.88 451.85 4 204.78 446.27 4 200.15 456.7 4 207.99 446.47 4 200.37 458.39 4 201.7 446.48 4 201.3 449.72 4 200.37 440.41 4 215.49 457.73 4 214.85 448.79 4 201.53 448.19 4 211.18 454.93 4 207.18 441.92 4 212.44 448.85 4 210.42 452.5 4 210.99 452.69 4 218.63 442.57 4 204.34 451.16 4 221.99 437.54 4 218.96 448.85 4 208.27 450.06 4 212.25 447.75 4 217.98 434.97 4 221.75 431.21 4 223.15 449.72 4 219.95 445.2 4 224.86 440.54 4 220.52 430.74 4 225.66 441.67 4 212.64 445.5 4 214.78 443.36 4 218.02 436 4 218.18 444.59 4 218.23 429.54 4 216.33 431.04 4 228.17 447.54 4 214.33 428.31 4 229.09 428.86 4 227.48 436.96 4 227.73 436.69 4 230.46 435.62 4 229.82 423.5 4 234.03 434.96 4 239.51 436.33 4 225.41 430.7 4 225.95 436.85 4 227.65 432.75 4 221.91 423.96 4 227.68 424.42 4 235.92 434.8 4 235.65 420.72 4 228.66 418.43 4 235.06 419.85 4 236.03 428.26 4 229.29 427.44 4 225.75 414.38 4 239.79 416.05 4 243.86 424.58 4 232.34 418.07 4 236.91 428.55 4 241.79 427.96 4 235.44 409.6 4 235.16 416.5 4 244.16 408.77 4 231.48 423.2 4 242.9 424.95 4 246.4 423.32 4 239.09 408.26 4 247.89 409.23 4 244.61 415.54 4 245.38 410.68 4 244.9 411.19 4 236.65 408.73 4 243.95 404.73 4 254.25 407.38 4 251.26 415.82 4 247.76 399.39 4 252.76 404.75 4 240.98 403.86 4 236.98 413.05 4 240.26 404.66 4 255.2 395.54 4 258.91 404.23 4 243.74 406.34 4 252.82 397.57 4 250.77 401.71 4 247.93 399.11 4 252.33 393.87 4 255.98 391.3 4 245.39 396.91 4 246.85 408.06 4 251.04 401.55 4 258.05 396.09 4 246.61 390.08 4 245.69 393.88 4 259.65 385 4 260.94 396.9 4 245.64 402.94 4 244.92 403.08 4 251.78 388.86 4 243.08 398.84 4 258.91 382.06 4 248.48 385.45 4 259.08 386.77 4 250.36 382.85 4 247.62 398.27 4 261.63 384.05 4 247.43 381.97 4 250.61 395.19 4 264.37 390.65 4 258.36 391.3 4 190.06 455.47 4 195.48 451.46 4 201.24 458.19 4 198.89 458.24 4 203.55 468.81 4 199.34 466.97 4 191.85 452.98 4 203.38 455.55 4 198.53 464.05 4 203 455.49 4 194.69 464.91 4 186.61 453.88 4 190.47 462.79 4 195.8 462.98 4 189.89 460.86 4 190.97 452.25 4 194.53 459.51 4 183.97 458.15 4 198.73 449.22 4 186.47 447.6 4 184.29 449.35 4 176.73 455.3 4 191.18 446.89 4 179.84 452.04 4 175.18 443.39 4 181.11 456.55 4 190.27 440.32 4 190.94 453.05 4 189.19 450.35 4 179.1 452.01 4 174.44 449.97 4 172.03 434.57 4 174.07 450.58 4 175.13 449.2 4 165.88 448.76 4 168.29 432.29 4 179.29 450.29 4 172.98 449.92 4 172.1 450.39 4 167.58 442.8 4 174.78 439.86 4 164.13 436.56 4 176.13 446.54 4 169.48 447.33 4 178.52 433.01 4 166.08 434.11 4 162.53 429.1 4 176.36 432.25 4 168.91 448.69 4 177.75 445.15 4 164.68 432.66 4 166.76 437.64 4 164.07 435.11 4 152.04 431.92 4 168.08 423.64 4 158.56 432.68 4 161.75 426.83 4 170.06 433.08 4 154.79 423.99 4 165.09 429.73 4 154.66 435.48 4 147.97 434.71 4 147.19 430.58 4 154.19 422.82 4 151.84 435.21 4 162.41 430.96 4 153.94 425.48 4 153.24 435.55 4 149.07 419.97 4 146.52 418.66 4 148.73 424.08 4 148.69 416 4 138.89 425.37 4 153.89 411.67 4 152.34 417.64 4 142.22 424.99 4 153.99 412.53 4 149.8 408.78 4 147.58 417.33 4 144.91 423.09 4 148.67 411.39 4 140.61 414.57 4 134.42 406.89 4 146.25 408.91 4 139 418.24 4 140.38 412.91 4 139.69 400.23 4 138.24 409.93 4 137.67 402.47 4 149.3 409.35 4 126.59 412.45 4 140.89 405.4 4 129.15 412.94 4 126.7 400.82 4 140.11 404.86 4 138.7 397.16 4 125.15 403.94 4 132.31 411.92 4 138.65 413.69 4 128.17 395.76 4 132.45 392.75 4 140.23 391 4 135.5 390.57 4 125.47 394.75 4 138.57 407.52 4 131.49 408.38 4 123.88 393.71 4 137.26 394.43 4 126.34 401.15 4 124.47 400.66 4 122.16 404.27 4 136.05 397.37 4 134.26 394.87 4 138.96 386.37 4 136.19 387.27 4 123.36 388.86 4 138.88 390.61 4 139.61 397.8 4 128.32 386.19 4 120.12 401.68 4 123.52 388.45 4 119.1 392.93 4 133.36 385.84 4 120.26 400.77 4 134.65 392.68 4 119.76 393.56 4 124.64 386.31 4 128.29 396.37 4 120.41 393.71 4 123.22 385.37 4 145.77 393.51 4 137.14 393.7 4 138.18 393.07 4 137.07 397.67 4 140.52 389.69 4 135.67 398.87 4 128.85 408.99 4 130.66 405.39 4 134.79 398.03 4 135.89 406.96 4 150.18 418.17 4 142.18 414.92 4 140.63 411.45 4 145.11 407.78 4 147.82 411.9 4 151 407.72 4 150.83 415.74 4 135.16 401.2 4 136.89 414.08 4 140.62 404.82 4 150.2 408.49 4 152.68 422.47 4 151.35 412.86 4 157.05 424.34 4 148.68 426.44 4 160.54 408.35 4 149.52 417.35 4 153.08 417 4 155.37 412.9 4 159.1 408.44 4 150.24 427.06 4 152.83 419.78 4 160.87 431.75 4 158.89 428.23 4 153.08 416.97 4 167.93 434.65 4 166.45 424.95 4 163.38 433.96 4 160.96 427.63 4 161.1 433.9 4 181.14 437.87 4 176.46 438.86 4 169.81 438.33 4 182.45 430.82 4 163.65 445.6 4 181.41 431.22 4 166.67 440.68 4 178.08 432.8 4 167.84 440.94 4 169.44 436.48 4 171.23 449.72 4 182.82 444.9 4 176.3 445.6 4 188.05 441.99 4 183.7 439.02 4 175.06 445.93 4 180.96 448.71 4 183.01 442.18 4 169.45 449.21 4 187.35 437.74 4 191.27 444.3 4 182.97 438.2 4 185.82 440.49 4 189.89 441.88 4 188.1 445.55 4 182.45 448.23 4 177.89 452.31 4 193.3 455.14 4 195.03 439.95 4 189.35 439.2 4 117.72 385.48 4 122.63 377.76 4 121.74 387.04 4 124.2 375.18 4 127.16 382.54 4 127.69 382.12 4 123.44 381.06 4 121.13 376.28 4 127.91 371.9 4 133.42 381.22 4 130.44 374.66 4 136.7 380.41 4 128.86 374.55 4 136.66 367.58 4 138.2 382.07 4 127.34 375.42 4 140.79 381.27 4 125.45 380.05 4 132.13 378.84 4 131.31 376.54 4 137.27 366.15 4 133.56 370.78 4 138.64 360.85 4 138.61 361.91 4 137.03 359.91 4 142.13 359.84 4 140.54 361.11 4 139.26 374.2 4 130.06 359.98 4 147.35 368.34 4 140.84 353.83 4 134.76 366.96 4 150.47 356.02 4 144.82 367.66 4 151.73 367 4 139.22 371.1 4 147.48 372.7 4 141.74 356.31 4 147.96 360.38 4 139.26 357.93 4 141.8 358.32 4 156.21 348.31 4 156.45 363.2 4 156.08 352.66 4 150.86 357.83 4 152.37 350.2 4 158.09 357.93 4 156.27 360.5 4 157.75 363.24 4 155.08 362.48 4 162.43 346.75 4 163.35 349.73 4 161.67 346.8 4 148.99 356.4 4 153.2 348.88 4 159.99 354.49 4 156.76 343.47 4 152.85 362.09 4 153.94 347.41 4 154.56 353.59 4 163.39 342.31 4 165.92 338.05 4 157 345.79 4 172.24 340.97 4 164.41 345.46 4 157 348.08 4 161.71 337.14 4 154.67 344.41 4 172.13 348.46 4 163.47 341.69 4 164.7 343.18 4 172.45 337.05 4 171.88 346.85 4 163.73 335.13 4 175.97 338.81 4 157.38 343.17 4 156.84 337.49 4 166.79 351.48 4 171 345.14 4 172.25 346.07 4 176.58 341.91 4 169.96 332.16 4 178.97 325.11 4 178.44 326.47 4 169.4 336.15 4 181.5 328.26 4 171.77 343.04 4 176.6 328.48 4 175.76 340.22 4 172.65 341.4 4 174.81 327.5 4 191.28 324.43 4 190.37 340.29 4 175.31 340.67 4 186.26 340.09 4 176.14 336.95 4 184.61 340.22 4 182.48 338.43 4 190.87 322.33 4 176.67 325.27 4 178.32 337.64 4 186.63 326.13 4 176.43 333.61 4 177.88 335.8 4 191.76 332.31 4 179.12 338.13 4 185.95 329.39 4 187.96 330.96 4 175.63 337.27 4 179.15 321.21 4 193.52 328.24 4 178.94 327.4 4 191.32 338.3 4 193.89 324.32 4 179.57 336.73 4 184.27 335.84 4 183.85 330.84 4 192.23 323.22 4 193.01 325.36 4 184.32 327.39 4 200.13 343.46 4 202.98 332.06 4 198.42 329.82 4 188.9 340.01 4 200.39 344.17 4 191.98 331.19 4 187.45 334.11 4 196.08 342.22 4 192.31 342.2 4 192.81 337.94 4 190.76 349.94 4 200.67 349.47 4 208.32 341.83 4 197.8 346.03 4 208.77 350.61 4 201.54 335.23 4 193.3 346.37 4 196.31 345.38 4 202.7 337.96 4 208.83 333.32 4 195.91 344.7 4 212.93 334.53 4 207.48 342.27 4 202.13 353 4 203.85 341.3 4 199.29 341.41 4 212.12 341.43 4 206.18 336.02 4 200.38 340.09 4 200.35 350.43 4 206.29 341.36 4 217.96 351.75 4 222.43 337.16 4 218.31 344.44 4 211.09 350.99 4 214.13 346.28 4 208.15 339.44 4 218.07 341.14 4 213.75 349.12 4 215.24 337.6 4 224.5 350.74 4 210.19 344.25 4 209.7 357.27 4 211.29 347.59 4 220.44 348.32 4 222.85 343.04 4 219.92 351.36 4 225.22 345.5 4 225.96 340.91 4 222.28 357.91 4 220.36 363.04 4 219.53 361.85 4 226.05 346.9 4 220.05 353.52 4 228.9 362.98 4 225.64 346.26 4 228.64 353.13 4 220.18 360 4 223.64 360.37 4 228.58 354.81 4 228.59 367.66 4 231.4 371.1 4 242.12 371.48 4 232.93 371 4 231.8 363.3 4 242.2 355.7 4 228.65 358.03 4 229.7 372.52 4 232.95 355.08 4 229.14 363.95 4 234.46 370.7 4 247.11 373.95 4 243.17 358.13 4 239.66 359.27 4 232.77 365.49 4 243.63 368.23 4 241.06 373.55 4 240.9 367.56 4 248.27 376.65 4 237.77 360.71 4 253.71 364.84 4 243.26 379.69 4 254.33 375.17 4 245.46 373.74 4 247.71 366.26 4 240.4 366.04 4 256.63 382.68 4 247.55 372.71 4 248.04 377.17 4 240.53 363.84 4 242.42 377.33 4 257.53 369.08 4 257.42 370.07 4 251.96 382.08 4 248.29 369.64 4 259.34 385.78 4 253.46 371.86 4 255.27 373.52 4 244.83 369.61 4 248.63 379.58 4 235.05 369.8 4 237.22 372.49 4 249.32 368.07 4 242.86 374.1 4 238.66 362.07 4 250.89 375.81 4 241.31 370.12 4 237.49 362.68 4 237.23 371.37 4 246.65 360.96 4 219.41 365.99 4 223.52 360.2 4 228.85 369.91 4 217.11 361.84 4 234.9 357.68 4 222.46 363.06 4 223.96 361.62 4 230.89 367.73 4 229.33 357.19 4 230.89 369.42 4 226.28 352.54 4 213.51 353.54 4 214.99 363.01 4 226.78 361.09 4 217.91 354.33 4 214.09 357.95 4 221.93 355.37 4 229.22 349.4 4 225.11 358.57 4 211.71 354.16 4 210.42 344.9 4 213.16 343.07 4 213.08 349.23 4 206.17 350.93 4 219.06 343.72 4 217.43 348.01 4 206.71 339.37 4 212.88 345.11 4 214.92 342.69 4 210.66 343.11 4 202.17 343.93 4 190.27 339.93 4 191.43 339.48 4 201.45 328.21 4 205.77 346.44 4 206.99 330.85 4 200.21 339.23 4 201.98 341.81 4 193.11 330.56 4 195.36 338.81 4 192.18 333.78 4 178.05 337.99 4 182.74 327.85 4 187.46 341.07 4 191.62 326.92 4 189.26 333.86 4 181.52 334.34 4 177.42 337.54 4 186.32 326.16 4 179.18 323.71 4 130.22 368.24 4 134.92 370.21 4 130.08 356.89 4 137.76 375.17 4 141.57 356.92 4 134.31 367.94 4 145.15 357 4 148.45 362.05 4 143.25 359.34 4 132.38 365.07 4 130.39 374.35 4 138.76 368.03 4 134.69 370.98 4 132.68 357.75 4 142.41 365.15 4 145.76 374.53 4 138.66 368.6 4 137.79 363.31 4 133.72 359.79 4 142.8 367.33 4 140.04 358.54 4 149.51 355.02 4 150.25 357.86 4 155.36 345.6 4 150.69 346.15 4 155.44 347.88 4 155.29 357.69 4 153.12 342.46 4 148.18 340.92 4 155.79 349.6 4 157.19 355.96 4 153.69 348.89 4 149.19 351.88 4 151.05 351.77 4 149.87 355.65 4 163.68 354.24 4 154.32 344.1 4 154.39 343.41 4 156.59 343.95 4 159.8 350.55 4 163.18 353.59 4 157.1 362.28 4 149.59 356.35 4 147.21 365.17 4 144.89 346.11 4 152.48 361.98 4 157.29 360.26 4 149.05 361.91 4 146.33 358.3 4 148.53 346.72 4 139.25 349.31 4 218.04 437.44 4 220.29 450.98 4 225.04 433.06 4 214.25 434.21 4 215.99 441.05 4 208.72 449.19 4 227.07 442.72 4 221.21 433.19 4 222.83 445.46 4 218.46 438.48 4 211.28 436.06 4 217.16 434.23 4 206.12 451.65 4 212.12 445.7 4 203.89 435.37 4 210.29 452.63 4 212.07 441.61 4 215.74 439.71 4 217.13 434.17 4 204.02 441.62 4 223.78 434.98 4 218.42 437.18 4 210.56 430.4 4 217.65 425.46 4 216.93 430.51 4 222.84 442.86 4 213.28 431.26 4 210.09 434.33 4 222.73 442.82 4 214.36 441.69 4 237.39 418.2 4 229.47 429.84 4 244.66 429.93 4 239.66 429.15 4 244.76 425.95 4 237.55 426.22 4 243.88 422.54 4 240.95 421.33 4 229.03 427.66 4 229.86 420.24 4 249.26 404.24 4 251.17 415.68 4 251.71 419.58 4 252.39 420.42 4 251.73 411.15 4 244.97 401.75 4 242.29 401.06 4 238.15 419.37 4 250.35 412.23 4 244.49 419.99 4 259.66 395.81 4 261.1 407.93 4 250.81 404.14 4 254.44 408.97 4 252.95 405.59 4 262.43 394.1 4 255.92 397.37 4 261.36 395.05 4 250.06 408.09 4 262.14 392.31 4 257.57 396.5 4 270.06 403.05 4 262.73 409.5 4 267.01 408.7 4 262.4 392.08 4 267.11 395.43 4 271.62 396.93 4 262.14 393.18 4 271.35 390.59 4 257.85 406.07 4 228.08 364.61 4 231.64 362.28 4 243.51 364.21 4 226.72 357.82 4 230.6 367.68 4 240.6 362.96 4 238.25 372.67 4 229.44 360.23 4 232.08 364.63 4 232.92 361.7 4 252.77 381.88 4 247.62 392.28 4 263.44 387.81 4 253.68 382.19 4 259.08 392.99 4 264.67 393.6 4 255.51 377.17 4 262.49 379.26 4 252.66 388 4 247.5 388.68 4 256.48 390.36 4 256.89 377.67 4 261.53 378.71 4 255.17 386.52 4 254.66 392.37 4 258.6 389.34 4 247.35 382.35 4 265.2 376.06 4 264.45 376.43 4 250.61 380.42 4 256.62 386.65 4 251.43 395.22 4 266.27 388.11 4 258.57 391.79 4 259.73 391 4 254.08 399.91 4 253.78 392.9 4 257.59 388.64 4 261.87 397.35 4 252.47 400.82 4 264.4 398.63 4 250 403.1 4 259.46 396.45 4 261.72 408.46 4 252.46 393.86 4 254.19 409.98 4 258.04 397.77 4 247.66 397.47 4 266.59 403.65 4 262.47 405.94 4 247.05 409.1 4 245.69 420.5 4 234.86 407.95 4 240.49 409.16 4 238.89 408.82 4 242.22 414 4 247.82 414.76 4 238.5 407.32 4 243.72 422.43 4 246.33 408.02 4 231.04 421.65 4 237.27 421.82 4 226.69 419.44 4 224.6 421.41 4 239.16 426.92 4 228.72 420.58 4 227.71 426.17 4 241.93 429.33 4 239.63 419.52 4 234.09 415.81 4 227.93 434.94 4 233.62 436.41 4 217.54 425.56 4 232.23 439.02 4 234.67 427.1 4 218.98 420.79 4 229.37 429.67 4 216.5 427.53 4 233.05 439.01 4 217.24 436.4 4 213.08 430.56 4 203.91 430.68 4 215.42 444.26 4 202.52 431.18 4 220.37 447.48 4 201.17 431.52 4 213.4 434.65 4 213.54 439.4 4 210.96 434.04 4 220.72 442.19 4 210.34 444.44 4 194.43 450.68 4 199.3 447.32 4 194.28 456.99 4 195.82 444.78 4 204.58 447.64 4 200.65 446.53 4 204.44 454.08 4 196.37 445.67 4 213.8 454.18 4 352.62 446.2 3 347.57 441.9 3 351.7 443.21 3 349.3 435.54 3 350.36 444.3 3 348.3 435.53 3 349.05 449.92 3 364.43 437.18 3 353.32 446.58 3 348.12 437.77 3 354.46 440.77 3 347.62 438.56 3 347.37 433.29 3 341.16 442.67 3 354.43 446.37 3 341.44 435.51 3 346.13 433.37 3 354.07 437.71 3 350.66 441.36 3 339.25 435.67 3 354.16 434.2 3 330.65 452.58 3 336.68 441.11 3 334.92 443.7 3 331.34 453.99 3 331.77 453.2 3 335.38 434.56 3 332.28 437.36 3 344.19 447.13 3 344.17 435.99 3 341.41 444.4 3 353.29 445.79 3 342.73 439.77 3 351.52 448.73 3 340.21 438.36 3 334.65 443.17 3 345.58 443.9 3 343.81 439.37 3 342.66 446.82 3 334.2 449.86 3 341.76 435.64 3 358.82 442.96 3 366.64 442.46 3 367.76 434.95 3 368.64 435.45 3 364.24 444.29 3 368.97 449.81 3 356.26 445 3 362.38 452.6 3 359.9 444.43 3 368.63 447.55 3 365.35 482.83 3 367.35 484.42 3 365.03 479.79 3 380.03 482.61 3 381.3 487.57 3 370.08 475.27 3 366.23 487.66 3 375.72 487.36 3 370.03 479.5 3 382.27 485.04 3 365.76 471.48 3 368.39 469.64 3 383.48 482.17 3 384.07 476.72 3 384.59 473.77 3 381.4 468.12 3 384.2 478.41 3 376.22 467.17 3 368.14 479.56 3 383.01 466.91 3 379.56 515.72 3 382.6 511.26 3 381.47 511.99 3 363.12 504.73 3 366.87 516.62 3 365.36 507.87 3 372.71 513.27 3 364.66 501.16 3 365.96 505.78 3 382.53 497.86 3 282.64 365.27 -1 180.87 329.72 -1 348.24 507.23 -1 148.71 327.57 -1 354.93 403.76 -1 222.26 360.51 -1 198.31 434.52 -1 320.1 432.7 -1 189 515.62 -1 258.98 449.59 -1 179.34 359.65 -1 132.88 525.39 -1 340.22 470.39 -1 148.85 434.21 -1 263.38 506 -1 285.63 385.88 -1 117.82 519.24 -1 161.7 440.67 -1 120.28 342.03 -1 320.12 429.9 -1 306.87 393.55 -1 213.5 443.29 -1 297.92 536.26 -1 142.18 403.83 -1 338.84 327.16 -1 231.48 487.44 -1 145.03 394.24 -1 143.95 470.76 -1 146.31 433.12 -1 284.02 514 -1 171.74 489.49 -1 231.47 410.66 -1 335.18 436.78 -1 206.13 452.43 -1 191.19 435.46 -1 337.81 486.17 -1 173.22 484.99 -1 293.66 364.85 -1 272.55 423.46 -1 269.24 437.79 -1 202.3 354.73 -1 156.87 496.98 -1 152.72 358.4 -1 344.22 432.53 -1 245.29 379.47 -1 180 460.97 -1 327.66 423.96 -1 301.15 378.63 -1 161.06 441.34 -1 155.67 341.81 -1 297.47 447.43 -1 294.78 512.77 -1 290.77 484.74 -1 167.93 442.3 -1 338.69 472.18 -1 282.69 514.68 -1 368.29 445.28 -1 295.5 525.92 -1 184.7 446.36 -1 130.27 514.15 -1 127.7 422.99 -1 310.78 357.32 -1 183.15 374.53 -1 156.88 458.72 -1 129.54 418.71 -1 335.13 457.69 -1 343.7 374.41 -1 179.41 531.05 -1 318.73 479.32 -1 193.51 473.08 -1 208.64 409.69 -1 206.21 466.76 -1 135.63 422.2 -1 138.46 543.74 -1 250.73 342.26 -1 218.81 426.49 -1 229.69 477.14 -1 284.29 532.91 -1 296.07 463.16 -1 288.54 459.93 -1 209.16 465.87 -1 317.9 355.28 -1 304.66 376.21 -1 275.39 444.87 -1 354.99 375.45 -1 375.01 434.52 -1 348.9 501.32 -1 203.83 470.36 -1 296.09 329.48 -1 361.14 367.97 -1 301.95 411.83 -1 386.27 527.61 -1 251.95 457.88 -1 191.07 370.28 -1 304.1 399.64 -1 385.17 381.44 -1 174.27 407.84 -1 229.46 518.63 -1 161.27 445.97 -1 184.84 394.25 -1 223.55 466.99 -1 209.15 458.42 -1 367.3 341.21 -1 239.41 403.66 -1 259.26 532.49 -1 252.43 528.64 -1 335.53 379.85 -1 334.05 380.68 -1 149.39 496.62 -1 283.93 411.55 -1 325.42 520.47 -1 194.28 422.86 -1 190.06 432.93 -1 243.01 405.05 -1 217.51 407.01 -1 246.54 424.18 -1 235.51 482.56 -1 186.32 485.29 -1 155.66 396.77 -1 189.27 425.58 -1 211.39 406.11 -1 185.6 335.59 -1 122.48 542.1 -1 238.88 455.25 -1 360.86 422.39 -1 294.33 539.7 -1 156.27 476.01 -1 187.2 430.9 -1 330.4 401.54 -1 313.23 418.2 -1 338.97 367.56 -1 369.39 443.68 -1 169.11 536.44 -1 259.36 363.25 -1 193.15 494.34 -1 227.49 509.44 -1 182.19 525.38 -1 213.83 462.97 -1 161.89 454.64 -1 118.4 494.88 -1 227.03 382.59 -1 192.88 323.07 -1 165.37 381.55 -1 275.14 441.21 -1 311.66 371.82 -1 311.59 526.32 -1 254.42 346.56 -1 205.33 381.08 -1 239.98 429.06 -1 197.52 469.78 -1 301.31 509.12 -1 366.28 463.88 -1 196.14 502.57 -1 245.66 449.88 -1 222.96 431.74 -1 203.44 501.38 -1 225.29 420.35 -1 353.61 332.2 -1 373.92 543.17 -1 118.9 489.63 -1 279.38 451.28 -1 328.88 396.33 -1 289.34 454.59 -1 149.17 523.88 -1 308.7 510.69 -1 144 516.41 -1 274.9 344.51 -1 345.54 498.18 -1 179.74 412.47 -1 ================================================ FILE: data_src/data_DBCV/dataset_3.txt ================================================ -6.1698 2.2449 1 -2.6453 6.9494 1 -4.9691 4.9966 1 -3.0064 6.868 1 -4.3216 -3.2774 1 -0.45173 -4.763 1 2.2952 1.3859 1 -2.5657 7.1216 1 -2.3846 7.1456 1 -3.6913 6.332 1 -5.153 -2.4078 1 -1.3853 7.5652 1 1.8265 1.6745 1 3.1154 0.26274 1 -5.778 -1.4073 1 -2.4125 7.1112 1 2.5649 -3.0943 1 0.67043 -4.389 1 -0.79545 7.7774 1 -0.44995 -4.6329 1 -6.1132 1.9711 1 -6.1505 2.1439 1 -5.5912 -1.9206 1 -6.2993 1.4634 1 -1.3839 7.7193 1 -5.9906 2.6321 1 3.4322 -0.6101 1 -5.3034 -2.3233 1 -0.65154 -4.7226 1 -6.1521 0.39326 1 -3.8776 -3.7479 1 -6.3132 1.3069 1 0.93691 -4.4268 1 -0.35854 7.9465 1 -0.78787 -4.6477 1 -2.1887 7.3713 1 -0.74915 7.7361 1 -6.2274 0.0082439 1 -1.7182 7.5466 1 2.8764 -2.5084 1 2.0636 -3.6648 1 -4.6502 5.4532 1 2.7572 1.1565 1 2.5495 1.2955 1 3.2146 -0.17538 1 -6.2863 0.82606 1 -5.3056 4.401 1 -6.061 -0.26088 1 -5.2331 -2.2809 1 -2.6057 -4.3723 1 -5.6997 -1.5265 1 -0.87514 -4.7666 1 -6.1727 0.38501 1 2.5422 -3.008 1 -4.1156 5.9038 1 -2.0388 7.4532 1 3.4408 -0.76035 1 -5.2475 -2.2421 1 -5.5362 4.12 1 -1.4176 7.7016 1 2.4145 -3.3185 1 2.9195 -2.5891 1 -4.3937 -3.3767 1 1.7697 1.7792 1 -2.1413 -4.4583 1 -2.3857 -4.3793 1 -2.3096 7.314 1 -0.63063 -4.7641 1 -1.9559 -4.6505 1 -6.2077 0.76146 1 -5.1073 -2.3705 1 -5.8922 -1.2146 1 -3.1161 -4.1828 1 2.18 1.5453 1 -2.1073 7.4124 1 3.2978 -1.6224 1 1.2467 -4.2636 1 -3.2396 -4.0549 1 -5.7833 3.4376 1 -5.1881 -2.2561 1 0.60866 -4.403 1 2.178 -3.671 1 -1.998 -4.5224 1 -0.27819 -4.676 1 -6.1715 1.2513 1 -2.7158 7.0806 1 2.8426 -2.5933 1 -5.9742 -0.66186 1 -0.7476 7.7348 1 2.1055 1.7288 1 -5.8099 3.4073 1 -6.2355 0.50701 1 3.0143 -2.3951 1 -1.5247 7.5515 1 1.8699 -3.7282 1 -4.9572 4.9802 1 -4.3967 5.6445 1 3.3057 -1.5753 1 -4.8397 5.0022 1 -5.0865 -2.5794 1 3.3492 -1.7359 1 2.8991 -2.6453 1 -3.9605 -3.5645 1 1.5611 1.8471 1 3.3819 0.011511 1 0.16344 7.931 1 -1.5348 -4.6219 1 -3.9732 -3.56 1 -1.3544 -4.715 1 3.0456 -2.5702 1 -4.9667 -2.6651 1 -0.10006 7.8403 1 -6.1917 0.36259 1 -3.9011 6.0356 1 -0.90232 -4.6295 1 3.021 -2.4461 1 -6.076 -0.34853 1 3.214 0.30889 1 2.6117 1.3262 1 -0.83203 -4.6298 1 -6.2157 1.6625 1 -5.8382 3.1314 1 -6.1839 1.805 1 -2.8905 7.0309 1 0.14707 -4.5635 1 1.4356 -4.1191 1 -5.19 -2.3845 1 -5.1262 4.6174 1 2.3806 -3.4144 1 -5.8711 3.4416 1 -6.1608 -0.093702 1 0.21929 -4.5793 1 -1.1652 -4.7321 1 -3.9254 6.228 1 -3.7933 -3.8204 1 -1.6515 -4.5556 1 -5.9038 -0.9972 1 -6.2008 0.83282 1 1.6254 -3.9999 1 -5.6922 3.7521 1 -1.4298 7.5338 1 -2.7265 6.8836 1 -3.3159 6.6052 1 1.7764 -3.8709 1 1.0788 -4.3696 1 3.365 -0.99269 1 -3.9803 -3.6076 1 -4.9048 5.0547 1 -3.8898 6.2905 1 -1.4511 7.5179 1 -5.4893 -1.7314 1 3.3652 -1.1164 1 -0.98328 -4.6799 1 2.2731 1.5816 1 3.0047 0.36568 1 -3.2663 -4.1672 1 1.4962 1.8261 1 3.2998 -1.8415 1 -6.094 -0.34383 1 -5.5513 4.1196 1 -6.1505 1.898 1 1.2574 -4.2461 1 -5.8598 -1.3176 1 -1.2144 7.7601 1 -3.7141 6.3007 1 -5.3108 -2.3238 1 -5.8959 3.1082 1 -4.6705 -3.115 1 -5.3752 -2.1539 1 -2.2387 7.2425 1 -5.9466 3.3294 1 -5.851 -1.2134 1 2.6023 1.2032 1 -2.72 -4.3814 1 -3.4958 -3.9466 1 -0.93825 -4.7496 1 -4.1957 -3.5278 1 1.2663 -4.2248 1 3.2018 -2.1513 1 -3.4006 -4.1237 1 -2.171 7.313 1 -4.5383 5.4181 1 -4.0748 -3.5052 1 0.085267 -4.7325 1 2.4304 1.2277 1 -2.9598 6.8878 1 -6.1139 2.1374 1 -1.4481 7.5303 1 -4.8102 5.1402 1 -5.8818 3.5197 1 3.3688 -1.2538 1 -5.1811 4.6182 1 3.2674 -1.4453 1 -1.4325 -4.6638 1 3.3017 -0.12231 1 -2.1011 -4.5904 1 -4.0724 6.0534 1 -4.7317 5.2959 1 -5.4326 4.3945 1 -6.1893 2.2533 1 -2.2982 7.2706 1 3.0265 -2.3502 1 -3.7599 -3.904 1 -5.6157 3.99 1 -3.724 -3.84 1 -5.6048 3.8756 1 2.8111 -2.852 1 -0.80854 7.7524 1 -0.89907 7.8053 1 -6.0955 2.1419 1 -2.3342 -4.4729 1 -1.0081 -4.7688 1 -6.0881 2.6516 1 -1.6014 7.5111 1 -2.5625 -4.4832 1 -4.4079 -3.2629 1 -6.151 -0.31588 1 -5.9188 3.2525 1 -1.1186 -4.7033 1 -2.5774 -4.4689 1 -0.45431 -4.7551 1 -4.6819 5.3644 1 -0.77646 -4.7392 1 -4.583 5.6029 1 -4.9591 5.0709 1 3.3449 -1.0402 1 2.8335 -2.7907 1 -0.5272 -4.6913 1 -3.6065 6.3909 1 0.72579 -4.3608 1 2.8243 1.0215 1 -5.1695 4.6522 1 0.98626 -4.3009 1 -2.9066 6.8253 1 0.039289 7.8544 1 -4.4984 -3.1992 1 0.20231 -4.548 1 3.3438 -0.065579 1 -4.5904 -3.0627 1 -2.0359 -4.4939 1 -1.9076 -4.5294 1 -6.2583 0.89337 1 -6.1533 2.2783 1 2.5585 1.1847 1 -5.8076 3.3952 1 -4.1019 -3.4957 1 -6.303 1.606 1 -1.022 -4.8092 1 -3.0028 -4.3178 1 -6.1773 2.0103 1 -5.697 -1.4977 1 -6.1036 0.13014 1 -5.3868 4.384 1 -2.8206 -4.2493 1 -5.8641 3.2251 1 -0.57206 7.8735 1 -1.687 -4.6287 1 2.7958 0.98318 1 -4.4925 -3.149 1 -2.4689 -4.4222 1 -5.4706 -2.0264 1 -3.4792 6.6521 1 2.0998 -3.7863 1 -0.08993 -4.575 1 1.5086 1.9064 1 -6.0624 2.3474 1 -5.9216 3.1475 1 -2.0402 -4.5406 1 -1.5541 7.4392 1 -4.3069 -3.5199 1 -0.19346 -4.6784 1 -6.2463 0.82607 1 1.9587 -3.6271 1 -3.8419 6.3257 1 -5.8543 3.1637 1 -3.6623 -3.8544 1 -4.9279 -2.7225 1 2.452 1.2546 1 -4.961 -2.7089 1 -3.3635 6.6231 1 -1.7314 -4.6645 1 -2.1333 7.4257 1 -3.6074 -3.8437 1 -4.2607 -3.33 1 -5.9917 2.8099 1 -0.95921 -4.7305 1 -1.4213 7.625 1 -2.5912 -4.4147 1 1.2211 1.9544 1 3.1921 0.26394 1 2.9753 -2.3685 1 -6.181 2.5176 1 1.9384 -3.7714 1 -0.74784 7.7964 1 3.3864 -0.72418 1 0.34031 -4.4723 1 3.3078 -1.8496 1 -0.49715 7.8825 1 -0.83413 -4.7821 1 -2.8652 6.9821 1 -4.6518 5.3171 1 3.3604 -0.73416 1 -5.9145 -1.1117 1 -6.2117 1.0877 1 -5.8441 -1.089 1 -1.7505 7.5096 1 2.3767 1.3365 1 3.2739 -1.9942 1 2.8171 -2.8902 1 -6.1911 2.5085 1 -6.2916 1.5628 1 3.2681 -1.2102 1 -6.0338 -0.08985 1 -1.6683 -4.601 1 0.096065 -4.5557 1 -2.7946 6.8945 1 1.1918 -4.2068 1 3.4074 -0.2537 1 -5.9936 2.7372 1 -4.9291 5.1042 1 -2.4852 -4.4388 1 -1.7567 -4.6021 1 -4.6514 -3.0039 1 -1.0551 7.6402 1 -1.6229 7.5916 1 3.1767 -2.1681 1 -2.4498 -4.5642 1 -4.7123 -2.9139 1 -5.69 3.6948 1 3.2183 -0.028644 1 2.8023 1.0581 1 3.4585 -0.50888 1 2.0868 -3.7497 1 -4.4788 5.4788 1 3.3178 -0.50115 1 -3.317 -4.1684 1 -1.0013 -4.7458 1 -0.95184 -4.6885 1 -0.25688 -4.6315 1 -1.5891 7.565 1 -5.7677 3.7102 1 -1.0834 -4.6134 1 3.1957 -2.1511 1 -0.018241 -4.6831 1 -5.9338 2.837 1 -5.9907 -0.5755 1 -4.7271 5.3952 1 -2.897 -4.2621 1 0.47157 -4.5755 1 -6.2341 2.1326 1 -4.4388 -3.1694 1 -2.6483 7.1807 1 -5.726 -1.4455 1 -6.1568 -0.40491 1 2.2571 -3.3413 1 -4.0679 6.0436 1 -2.1194 7.3201 1 -3.3799 6.5573 1 -4.197 5.8339 1 3.0276 -2.2194 1 -3.9462 -3.6179 1 -5.2039 4.7859 1 -4.6198 5.4341 1 2.8861 0.76578 1 -3.0231 6.9319 1 2.8218 -2.8335 1 -2.8724 -4.2436 1 -3.9295 6.1823 1 2.3698 -3.4424 1 -4.1927 6.0031 1 -6.2545 1.9851 1 2.8982 -2.606 1 -6.2413 0.23513 1 -6.27 0.9425 1 -0.53161 7.9212 1 -4.7209 -3.0555 1 -6.0107 2.463 1 0.93173 -4.4139 1 -6.1823 1.3601 1 2.2714 -3.5054 1 -1.0324 7.741 1 -0.58375 7.8081 1 1.6579 -3.8829 1 2.9001 -2.4986 1 -5.674 -1.5775 1 -4.9496 -2.659 1 2.9873 0.69848 1 -4.3279 -3.3906 1 -1.3433 7.5808 1 -2.1915 -4.5722 1 -6.0963 0.13299 1 -2.353 7.2748 1 -5.7501 -1.5378 1 -3.885 -3.7594 1 -6.2901 0.44681 1 3.214 -2.0341 1 2.9397 0.65879 1 -5.5431 -1.8092 1 -6.1994 1.6203 1 -3.1293 6.7009 1 -6.2386 1.1145 1 -5.9896 -0.75084 1 -5.4608 4.4609 1 3.3977 -0.21849 1 2.1834 1.5013 1 2.152 -3.6281 1 -5.5111 -1.9358 1 3.2567 -1.7672 1 -6.1302 2.5927 1 -2.5059 7.1524 1 1.8789 -3.8147 1 3.3049 -1.0396 1 -4.602 -2.994 1 0.95192 -4.3575 1 -5.6499 3.798 1 -1.282 -4.756 1 -5.5084 4.3503 1 -0.066337 -4.5797 1 -2.1589 7.3053 1 2.8958 -2.7035 1 -5.8562 3.418 1 3.4431 -0.58257 1 -0.10927 8.0172 1 -5.9605 -0.4692 1 3.163 0.33222 1 -5.0862 4.8213 1 2.5286 1.2403 1 -1.4342 7.6504 1 -2.154 7.3785 1 -2.8986 -4.3834 1 -1.1953 -4.7598 1 -1.8133 7.4006 1 3.36 -0.72114 1 -4.0791 6.0075 1 -4.2142 -3.3836 1 0.15949 -4.6017 1 3.3754 -0.57769 1 3.4287 -1.356 1 3.3247 -1.1705 1 -3.6661 -3.7816 1 2.6744 1.1078 1 0.84009 -4.392 1 -3.3009 6.7279 1 -0.31977 7.9051 1 -4.5216 -3.1992 1 -0.35988 7.8863 1 -6.2713 1.3644 1 -6.0192 3.1203 1 -2.8101 6.9533 1 3.2133 -2.1276 1 0.21598 -4.559 1 -3.3028 6.6126 1 -2.2697 -4.4433 1 -5.4596 4.3848 1 -3.4816 6.5494 1 -1.0561 -4.7111 1 -0.13969 -4.7505 1 -3.3508 -3.9635 1 -3.6234 6.4307 1 1.4755 1.8387 1 -5.8838 -1.1636 1 -5.0427 4.8941 1 -3.5563 -3.9513 1 -0.48457 -4.7643 1 -2.276 7.3625 1 -0.96562 7.6426 1 -4.0564 6.0772 1 -4.5298 -3.2606 1 -5.9273 -1.0439 1 -2.6546 7.0087 1 2.1218 -3.5138 1 -5.5363 4.0746 1 -3.1884 -4.1543 1 -4.8921 5.1039 1 -6.196 2.0557 1 -3.8481 6.3307 1 -4.3378 5.7294 1 -5.0208 -2.7846 1 -5.8965 -1.0851 1 3.3137 -0.024915 1 3.3837 -0.22876 1 -4.6846 -3.0497 1 3.0696 -2.3043 1 -5.9582 3.1293 1 3.381 -0.70216 1 -5.7995 3.5909 1 0.42013 -4.4989 1 -6.2479 1.7108 1 -6.2452 1.1284 1 -3.4988 -4.0339 1 -1.2256 7.593 1 0.80094 1.9646 1 -5.5097 4.2706 1 -6.011 -0.44955 1 1.0144 -4.2568 1 1.7542 -3.8924 1 -5.514 -1.7172 1 -5.7481 3.6795 1 3.3633 -0.26338 1 -0.41615 7.7809 1 -2.7292 2.678 2 6.427 -1.5571 2 5.6794 1.9406 2 0.92956 -7.5603 2 5.5607 -4.2163 2 -0.381 4.6635 2 6.3861 -2.1147 2 -2.2577 3.5306 2 -2.1152 -1.2694 2 3.1563 -6.6707 2 -2.2423 3.3811 2 4.6982 -5.4378 2 -3.1782 0.9924 2 -0.1891 4.7885 2 -1.9537 3.854 2 3.3027 -6.5269 2 3.8727 4.129 2 6.5243 -0.31321 2 6.1538 -2.9673 2 2.4911 4.6972 2 2.0123 -7.1532 2 5.5601 -4.3073 2 3.3382 4.4703 2 -2.189 3.5106 2 6.1891 1.02 2 -1.1432 4.3208 2 5.0395 2.8845 2 6.3222 -2.5084 2 0.80556 -7.6111 2 -1.3839 -1.5559 2 3.9052 -6.2384 2 6.0627 -2.9543 2 4.5834 -5.5003 2 5.5657 2.2871 2 4.7444 -5.3809 2 6.3834 -2.2209 2 6.3097 0.46719 2 5.9742 1.3019 2 -3.0716 2.026 2 5.8631 1.5587 2 1.6789 4.9181 2 5.1373 2.9149 2 2.0008 -7.2334 2 1.3575 -7.4904 2 6.4563 -0.7725 2 2.707 4.5995 2 4.5101 -5.5125 2 5.963 1.4898 2 -0.33966 4.7476 2 5.2664 -4.7287 2 1.3812 -7.437 2 4.5617 -5.4559 2 5.587 -4.3219 2 3.0474 -6.7959 2 6.5192 -0.63595 2 6.1111 0.81528 2 -1.8965 3.899 2 5.7883 2.0388 2 2.9268 4.5453 2 6.2632 -2.7394 2 6.4489 -0.2652 2 6.4614 -1.1897 2 0.096154 4.8726 2 6.3421 -1.5528 2 5.8443 1.6615 2 6.4979 -0.40846 2 5.0686 -5.0257 2 6.3139 -1.9026 2 6.0893 -2.8633 2 -1.6164 -1.4629 2 2.5612 -7.0908 2 3.1338 -6.8139 2 1.8996 4.8209 2 4.9691 3.2237 2 2.2112 -7.2106 2 6.4475 -0.66118 2 1.7005 4.9799 2 1.5638 -7.4116 2 2.3391 -7.1162 2 6.209 0.51713 2 5.1163 2.784 2 1.4049 -7.4508 2 -2.8691 2.5268 2 6.4108 -0.012729 2 -1.7573 3.7635 2 -3.2408 1.3344 2 -3.1559 1.8975 2 1.9652 -7.2147 2 -3.159 0.98852 2 6.3528 -1.7104 2 4.8506 3.0983 2 -2.7115 -0.68904 2 0.68375 -7.616 2 6.4043 -2.163 2 6.4885 -0.88946 2 5.7026 -3.8504 2 1.0582 -7.5153 2 3.2273 -6.6599 2 3.1174 4.3946 2 3.0125 4.4683 2 6.3448 -0.13963 2 3.6491 -6.4201 2 -2.1277 3.6382 2 6.4675 -1.8384 2 4.1719 3.9544 2 4.6025 3.3786 2 4.0099 -5.9592 2 1.8536 -7.2324 2 5.7609 -3.705 2 6.3335 -1.442 2 6.4775 -0.7453 2 6.3857 -1.8588 2 -1.6809 3.8924 2 0.70171 -7.7085 2 5.2946 2.7231 2 4.9089 3.2015 2 5.9017 -3.3883 2 -2.2728 -1.1649 2 2.9752 4.6285 2 1.6232 4.9563 2 -1.0096 4.5489 2 2.0346 -7.241 2 6.3326 0.76215 2 2.7749 -6.8243 2 -1.8511 3.8155 2 4.4025 3.7079 2 4.8452 -5.2712 2 -2.8393 -0.62294 2 0.78812 -7.6084 2 2.3085 -7.0831 2 6.3191 -1.9666 2 6.3994 -0.45463 2 4.981 -4.9138 2 -2.2605 3.4586 2 0.72022 4.9324 2 -3.1875 0.31121 2 -3.2375 0.98241 2 5.4447 -4.5013 2 -2.9895 1.8525 2 5.9411 1.4263 2 -0.9885 4.304 2 2.952 4.6063 2 2.4579 4.8084 2 -2.1874 3.4713 2 -1.3382 4.2388 2 3.6935 4.3011 2 -3.251 1.4542 2 6.4796 -0.50248 2 -1.229 4.3682 2 1.6418 -7.4958 2 3.6038 -6.3414 2 -1.9447 3.6858 2 6.2058 -2.6473 2 1.3316 -7.5185 2 -3.126 0.82747 2 6.3842 -0.47861 2 -2.4151 3.2334 2 0.20408 -7.6577 2 0.62293 -7.716 2 3.2961 -6.4948 2 5.1814 2.8042 2 5.5743 2.2513 2 4.884 3.1278 2 2.0676 -7.1324 2 -2.6794 -0.68037 2 -2.9712 1.946 2 -1.2844 4.2258 2 2.5864 -6.9903 2 -3.19 1.3984 2 -2.0651 3.7256 2 2.7585 4.6294 2 5.3795 2.6451 2 6.0866 -3.357 2 4.0195 3.9389 2 3.1471 -6.5994 2 2.9439 4.4323 2 5.6417 -4.1828 2 -1.1651 -1.687 2 5.4889 2.552 2 1.8462 -7.352 2 5.2399 -4.6065 2 4.7348 -5.3232 2 6.4429 -0.42937 2 0.33706 -7.6951 2 3.3616 -6.5868 2 -2.4246 -0.99087 2 1.3711 4.8442 2 -2.7516 -0.67637 2 6.4497 -1.6127 2 6.3287 -2.4924 2 6.2181 0.61233 2 -1.9148 -1.4401 2 0.32903 4.8322 2 -1.0486 4.3948 2 1.8637 4.9261 2 5.946 1.6416 2 5.9782 -3.4719 2 4.3689 -5.8339 2 4.8701 -5.1862 2 -1.9648 3.7916 2 6.1463 -2.6648 2 2.493 -7.0228 2 5.8529 2.0451 2 0.14263 -7.7574 2 6.4427 -1.9957 2 4.5157 -5.6445 2 -2.8233 2.3889 2 1.4919 -7.3597 2 3.8798 4.0237 2 6.3055 -2.2512 2 -0.96533 4.3773 2 -2.9724 2.2722 2 0.0019711 4.8861 2 -2.8551 2.768 2 -2.533 3.1856 2 5.7456 2.1583 2 5.3981 2.519 2 5.5995 -4.2579 2 6.4312 -0.52667 2 2.4774 4.6862 2 4.7547 3.3431 2 -2.5939 -0.73218 2 1.2141 4.8501 2 1.6696 4.8471 2 -3.1081 0.27703 2 5.6124 2.4343 2 1.4382 4.9434 2 5.6154 2.4123 2 -2.1319 3.6001 2 4.9249 -5.107 2 0.050994 -7.7024 2 6.3555 -1.6063 2 -3.0791 2.1865 2 2.3547 4.7075 2 3.9816 -6.1968 2 1.0873 -7.5274 2 -2.6644 2.9582 2 2.0233 4.8139 2 -3.0489 2.1468 2 6.1675 1.2331 2 4.0886 3.9927 2 6.3816 -2.3946 2 6.3489 -2.3115 2 6.5316 -1.2236 2 1.9039 4.756 2 4.0596 3.8607 2 0.87403 4.8702 2 -3.1811 1.6199 2 -3.0779 0.47123 2 6.2672 -2.6707 2 6.0261 -3.0643 2 2.4272 -7.1491 2 -0.72962 4.6564 2 3.0473 4.4605 2 -1.992 3.6569 2 2.8542 -6.7082 2 1.9688 4.7551 2 4.8398 3.2717 2 6.5377 -0.83165 2 -2.7013 2.9882 2 3.4785 -6.5684 2 6.0918 -3.3086 2 4.1054 3.9046 2 4.2454 -5.8355 2 4.2297 -5.8555 2 6.1844 0.78824 2 -3.1072 0.79734 2 1.9617 -7.3447 2 5.6462 -3.983 2 -2.0419 3.8145 2 -1.4388 -1.5715 2 -3.0998 1.9846 2 5.8739 -3.4762 2 2.162 -7.1465 2 -2.0515 3.7541 2 4.9595 3.1266 2 1.3112 4.9388 2 -3.0044 2.2341 2 -2.6397 2.9147 2 5.8411 1.9259 2 -0.95939 -1.7587 2 5.5488 2.1736 2 5.0181 3.107 2 3.9234 -6.0841 2 6.2705 0.39543 2 6.419 -1.8796 2 3.7692 -6.1608 2 -0.061244 4.8233 2 0.85094 4.8976 2 4.1503 -5.9354 2 -2.0968 3.7061 2 0.83894 -7.5353 2 0.43546 4.8267 2 -3.141 1.7122 2 5.2341 2.7274 2 1.1118 4.8246 2 4.5534 3.5619 2 -1.5183 -1.5607 2 5.792 1.7899 2 2.1424 -7.1191 2 -0.54672 -1.6555 2 3.5336 -6.2971 2 5.1978 -4.7685 2 6.4257 -1.3189 2 -0.9037 -1.7984 2 -2.3201 -1.0552 2 0.10409 4.8321 2 3.3775 -6.5115 2 4.9017 3.3316 2 5.0012 -4.9709 2 3.4367 4.3207 2 5.8751 -3.5501 2 -0.34362 4.7043 2 1.0941 4.9251 2 5.1286 -4.8403 2 -3.0528 1.647 2 4.0627 3.8055 2 0.78173 4.8959 2 6.2907 -2.5884 2 5.7415 1.8189 2 -2.5618 3.1089 2 -0.6622 4.6721 2 4.671 3.4053 2 4.7148 -5.3409 2 5.6775 1.9527 2 3.0686 -6.8194 2 0.01098 4.9114 2 -0.75893 4.4866 2 6.112 1.0048 2 -3.1864 0.45116 2 4.5362 3.5128 2 4.3289 -5.8774 2 1.7771 -7.4087 2 1.2887 -7.4498 2 6.2827 0.83476 2 -0.65794 -1.674 2 3.0035 4.4978 2 -2.0238 3.7158 2 5.9609 1.7045 2 6.3526 -2.3443 2 -0.9987 -1.6175 2 6.41 -1.6784 2 1.0806 4.8737 2 -2.6533 -0.69065 2 0.1997 4.8505 2 5.9783 1.3177 2 5.5211 2.5726 2 4.3105 -5.7338 2 4.1865 3.9145 2 1.5045 4.9257 2 6.2142 -2.7649 2 -3.1703 0.47809 2 3.5251 -6.3682 2 2.3893 4.6788 2 -1.5404 -1.5897 2 3.7329 4.0781 2 5.6696 -3.8517 2 6.3815 -0.799 2 -2.3442 3.2488 2 3.5217 -6.3678 2 5.0262 -5.1201 2 6.4588 0.10579 2 3.522 -6.4586 2 4.1355 -5.8115 2 4.9176 -5.1472 2 -0.57683 4.5974 2 6.2585 -2.0077 2 -3.077 -0.045501 2 2.4779 -7.094 2 1.2709 4.9418 2 6.0997 1.2546 2 6.5594 -0.69323 2 -2.0176 3.6406 2 4.1245 -5.9357 2 3.6482 4.162 2 3.3778 4.3628 2 5.5678 2.1632 2 -2.6549 -0.58156 2 5.2243 2.8431 2 0.11568 4.8433 2 0.67305 4.9357 2 4.8975 2.9977 2 5.5937 -4.0394 2 4.4049 3.7283 2 -0.80157 4.4785 2 3.3968 4.3121 2 0.60669 -7.7112 2 3.5716 -6.4109 2 2.0704 4.8326 2 2.3665 -7.0215 2 0.64364 -7.7117 2 6.3317 0.14363 2 -1.8085 -1.5006 2 6.3736 -1.8954 2 4.0163 -5.9447 2 1.2159 -7.4892 2 -1.924 -1.4097 2 6.2921 -2.1773 2 5.3112 2.595 2 6.3487 -1.3399 2 -3.2277 0.95387 2 -0.53785 4.5734 2 2.1719 4.7195 2 5.0426 -4.8478 2 2.5985 4.6486 2 5.5778 2.289 2 2.9465 -6.7473 2 6.3913 -1.6915 2 -3.1084 0.22066 2 6.5278 -0.39557 2 5.2629 2.8711 2 0.28642 -7.7814 2 -2.8192 2.7832 2 5.9679 -3.221 2 4.0559 -6.0749 2 3.6671 4.1409 2 6.407 -1.1673 2 -1.2133 -1.6807 2 1.5211 4.9359 2 5.3294 -4.7386 2 6.168 -2.8356 2 0.23884 -7.6468 2 5.8236 1.9017 2 0.22286 -7.6584 2 -1.7286 4.0501 2 4.326 -5.7087 2 -0.67922 4.4918 2 5.4864 -4.3409 2 -0.92909 -1.6938 2 0.036188 4.8116 2 4.749 3.4099 2 4.3275 -5.7296 2 3.7241 -6.1536 2 -2.8217 2.7008 2 1.7299 4.8835 2 0.79821 -7.7292 2 5.9641 -3.2839 2 4.6373 -5.4592 2 6.5217 -0.28962 2 4.6218 -5.4654 2 6.4736 -1.6102 2 2.0935 -7.3214 2 6.2219 -2.6794 2 3.1983 -6.6586 2 1.9863 -7.3345 2 -1.6869 4.071 2 2.5985 -7.0221 2 6.1868 -2.8979 2 5.9955 -3.1368 2 6.2734 -2.2232 2 6.4181 -0.037133 2 2.9992 -6.7782 2 6.233 -2.8156 2 6.3801 -1.203 2 3.7793 -6.1845 2 1.4727 -7.484 2 -2.3906 3.3158 2 -2.5766 -0.76457 2 5.9565 1.6696 2 6.4291 -1.5225 2 0.27945 -7.8163 2 2.166 4.7129 2 5.5025 2.3945 2 3.852 -6.1221 2 4.3784 3.6777 2 5.6634 2.0806 2 4.3991 3.72 2 2.7868 -6.89 2 2.6342 -6.9544 2 -0.8654 4.368 2 6.0907 1.249 2 5.8933 -3.5383 2 4.1241 -5.9865 2 -2.5399 3.1338 2 6.0161 1.4765 2 6.2036 1.1623 2 4.9095 3.1261 2 6.4614 -1.1868 2 6.1981 0.7751 2 -2.7975 -0.46842 2 6.4708 -0.26954 2 -3.2166 0.34857 2 6.4273 0.32967 2 6.4009 0.43211 2 2.8121 -6.8849 2 -2.7374 -0.64882 2 6.1106 1.0906 2 -3.068 1.8386 2 1.1573 -7.6612 2 3.5393 -6.3337 2 6.2485 -2.7311 2 2.5691 4.5682 2 -3.0965 0.3 2 5.7621 1.9902 2 -0.56815 4.5058 2 1.5434 -7.4152 2 6.2456 -2.5883 2 3.8803 4.0058 2 -0.55827 4.5039 2 4.4997 3.666 2 0.99704 1.1724 -1 4.1822 -2.3831 -1 -2.1785 -1.2334 -1 0.69431 7.4779 -1 5.9458 4.7189 -1 5.9262 7.9366 -1 -5.7178 -7.4405 -1 3.0562 -7.0185 -1 -2.9601 -0.10538 -1 5.8036 4.6519 -1 1.7255 -4.3812 -1 -1.029 -3.1007 -1 -2.2978 7.6748 -1 -0.39582 -3.1281 -1 3.111 -7.2715 -1 4.5375 -2.682 -1 -4.3156 -2.0144 -1 1.7443 1.4558 -1 1.4728 -5.119 -1 5.2588 -6.8491 -1 3.7378 -5.8209 -1 -6.2272 3.1231 -1 4.094 -1.1167 -1 -5.1932 3.6454 -1 6.0181 2.3473 -1 -6.1718 2.0509 -1 -3.3875 -5.644 -1 0.6686 -2.2634 -1 2.0191 6.5422 -1 -3.8391 -6.6635 -1 1.7566 -3.3627 -1 0.60887 7.4868 -1 5.9015 0.56615 -1 -3.4838 1.9722 -1 -1.5884 -1.6991 -1 -1.2795 -5.9099 -1 -4.1946 1.8443 -1 -0.29807 -1.0464 -1 -0.81015 4.1765 -1 0.50489 -7.3674 -1 0.38651 7.9087 -1 -1.8725 -3.107 -1 -1.6886 -5.5213 -1 2.2159 7.4489 -1 6.1353 0.12555 -1 -0.99317 7.6643 -1 4.9855 5.5923 -1 -3.0679 7.9675 -1 6.3441 2.5904 -1 2.0811 -0.69084 -1 -3.535 -4.1746 -1 -3.5312 -4.2606 -1 -4.79 7.0115 -1 -4.1976 1.4007 -1 -1.5816 2.8529 -1 -6.2717 3.3302 -1 4.213 2.6279 -1 -0.41696 7.7572 -1 -3.8167 -3.4963 -1 3.9224 -6.3023 -1 -1.9521 1.791 -1 -5.0652 -4.0883 -1 5.2947 6.2156 -1 3.7967 -5.6004 -1 6.4926 6.4649 -1 5.7136 -7.4243 -1 0.81967 0.3283 -1 -2.9895 -0.00085951 -1 0.15954 5.9809 -1 -0.33288 -2.5584 -1 -4.2483 6.1688 -1 5.2115 -5.1639 -1 -3.2276 -4.5736 -1 2.5012 -5.6455 -1 0.89988 -0.24351 -1 4.9861 6.2819 -1 -1.9625 -5.6476 -1 -5.2149 7.8121 -1 3.1152 4.2004 -1 -6.2495 7.3756 -1 -1.9432 -7.2925 -1 -0.65084 -0.018711 -1 -4.0715 -4.323 -1 -0.53212 -3.5106 -1 3.933 5.2171 -1 -5.1819 -6.2939 -1 -0.98123 -7.1683 -1 1.1532 -4.7665 -1 5.3666 2.8373 -1 -6.181 1.9162 -1 3.5842 0.20032 -1 1.8329 3.8181 -1 3.4293 2.4644 -1 2.9413 4.0821 -1 2.4294 3.5826 -1 0.060931 -6.7637 -1 3.3046 7.0353 -1 3.6781 -4.2435 -1 -3.5133 3.5795 -1 1.5065 5.9587 -1 3.6982 -4.6055 -1 -0.052565 -2.7876 -1 5.8367 -3.1035 -1 1.0768 -7.1842 -1 -5.8994 5.6035 -1 2.8926 -6.6268 -1 2.584 -3.2516 -1 5.0489 -7.1636 -1 -3.8496 6.3882 -1 5.3396 -4.6697 -1 -5.442 6.8484 -1 -5.3506 6.4377 -1 -0.94144 7.9699 -1 2.672 5.7087 -1 -5.2121 3.4423 -1 -4.8627 -1.2793 -1 4.579 -7.5973 -1 -1.371 1.0155 -1 -2.8194 5.4605 -1 -0.071015 -6.3005 -1 3.4621 -6.4566 -1 -1.5589 4.268 -1 0.83669 1.8137 -1 -4.9696 -7.4708 -1 6.3262 -1.9484 -1 3.3632 -7.1643 -1 -5.4842 -4.819 -1 6.3909 -7.7097 -1 -0.85769 5.5245 -1 -5.9393 7.5945 -1 -5.0688 -1.5275 -1 -2.0687 2.8432 -1 2.3228 3.5423 -1 5.2122 6.4419 -1 5.0437 -2.8339 -1 4.9325 3.8192 -1 1.5656 -4.3287 -1 -0.18476 -1.6329 -1 -5.8413 -5.0362 -1 -4.5102 5.595 -1 4.5549 -0.8138 -1 -6.0414 5.0219 -1 -3.627 -2.0888 -1 2.115 7.7527 -1 -4.2277 -2.239 -1 4.1579 -6.7695 -1 3.7898 0.30789 -1 -1.4327 5.3565 -1 -5.321 3.4747 -1 2.9305 -4.3483 -1 -0.65996 7.4931 -1 5.7946 5.2348 -1 0.71088 -3.4582 -1 2.514 0.80654 -1 -3.8192 1.7883 -1 2.4929 -4.2774 -1 6.4036 1.7112 -1 -5.3291 5.499 -1 0.30721 -0.5677 -1 -5.4762 0.93261 -1 2.4093 2.977 -1 -1.8227 -3.0121 -1 1.8712 6.9915 -1 -1.6002 5.9287 -1 -5.0192 1.8056 -1 -3.5259 -6.7022 -1 6.2082 2.0681 -1 1.8873 -5.8954 -1 -2.76 -0.7612 -1 -2.9446 3.4073 -1 0.20439 -5.1879 -1 1.0054 5.3538 -1 3.0021 4.8663 -1 -3.1108 2.0954 -1 6.0641 6.7557 -1 -4.0038 6.3951 -1 2.4967 5.9365 -1 3.3355 -6.9662 -1 -4.1238 7.3905 -1 5.4149 -0.027801 -1 -3.1404 -3.3035 -1 0.87371 -0.91466 -1 -5.0495 4.1747 -1 -4.5955 -5.998 -1 2.2079 0.90491 -1 6.3577 7.0719 -1 -5.0772 1.2271 -1 3.1964 -0.33412 -1 -0.67638 -2.1462 -1 5.0756 -0.61201 -1 -1.7577 4.2782 -1 0.083412 6.5329 -1 -2.5483 0.5376 -1 6.1629 -2.9882 -1 3.771 0.29035 -1 -0.030158 6.6956 -1 -3.2344 4.4634 -1 1.2983 4.2072 -1 6.3666 4.9412 -1 1.3649 4.8555 -1 -1.2182 0.7369 -1 2.026 -2.5166 -1 5.9935 -7.1565 -1 -4.7681 -6.3109 -1 5.3053 -5.3231 -1 0.97327 6.2888 -1 -3.1591 0.051957 -1 -1.0545 -2.9558 -1 -5.7917 2.8668 -1 -5.666 3.1302 -1 3.1989 -7.4469 -1 2.3884 7.9023 -1 2.6722 -4.6645 -1 1.5957 -4.4956 -1 -0.76716 -5.4389 -1 -2.0792 6.6362 -1 4.0531 -4.8692 -1 4.9185 1.1542 -1 3.8435 0.35499 -1 5.601 6.4115 -1 4.4642 6.4834 -1 -0.59422 6.7979 -1 -1.3794 -6.5383 -1 -3.7656 4.9034 -1 0.24239 -7.5053 -1 -5.6226 6.4793 -1 2.9967 5.7301 -1 -2.8845 -4.9609 -1 1.6857 -1.3776 -1 -0.061039 -6.8821 -1 -1.7615 -6.323 -1 0.73737 0.66436 -1 4.0347 3.5063 -1 2.9222 -4.4931 -1 -0.92616 0.67313 -1 2.5323 3.0703 -1 -0.41022 6.8608 -1 1.3276 6.1547 -1 -2.7322 2.3895 -1 4.9463 5.5359 -1 4.7218 -0.60724 -1 1.4137 -3.0039 -1 -2.1774 -4.2412 -1 -4.1458 3.2823 -1 3.609 3.0428 -1 -5.9599 -7.427 -1 -3.6884 2.5647 -1 -3.2222 6.0661 -1 -0.67926 6.8599 -1 -4.5158 4.3019 -1 -0.92901 -3.9958 -1 -3.7309 7.9855 -1 3.8052 -2.7865 -1 -0.58681 0.71993 -1 3.6634 7.7778 -1 2.8205 -4.1545 -1 5.1298 -2.5519 -1 5.6018 -6.8103 -1 -1.2913 4.8921 -1 1.3781 0.71746 -1 -2.4163 -4.3378 -1 5.5181 7.4863 -1 -3.2052 -4.9288 -1 4.8516 4.1756 -1 -4.5146 6.2547 -1 3.3334 -1.2291 -1 4.6495 5.4796 -1 2.8658 0.35381 -1 -4.8023 -3.2821 -1 -1.9408 5.8408 -1 2.1486 -4.3912 -1 -4.661 6.4978 -1 -2.1517 -6.3523 -1 2.1335 6.5482 -1 -0.41094 1.4984 -1 0.8161 -2.9263 -1 -3.3417 0.18804 -1 -2.0546 7.0922 -1 5.0153 6.8427 -1 3.7475 2.4096 -1 -0.95336 -6.4451 -1 -1.7757 7.1045 -1 5.3302 -2.9651 -1 3.2545 -2.3742 -1 4.8409 -1.274 -1 3.7683 -1.4661 -1 1.6694 6.1417 -1 -5.7319 -2.5869 -1 6.5222 0.5423 -1 -5.0951 -5.5717 -1 3.2518 6.8628 -1 -1.9046 3.7882 -1 -5.5292 5.8553 -1 -3.2266 7.7296 -1 1.3355 -5.6115 -1 -3.8097 6.1073 -1 3.7864 6.5807 -1 4.575 -3.234 -1 -2.7285 1.0933 -1 0.17763 -0.47547 -1 -5.3179 7.0361 -1 0.78796 -7.4162 -1 4.7834 5.5481 -1 -5.335 -7.4817 -1 3.2472 0.16747 -1 5.1979 -5.2918 -1 4.9531 -2.7009 -1 0.76115 -1.2127 -1 3.3223 5.3327 -1 -5.2735 5.2985 -1 -2.7887 -4.7701 -1 3.1233 -7.3721 -1 1.4966 -7.1718 -1 -3.4529 -2.7676 -1 6.5528 -1.859 -1 6.2053 4.9028 -1 3.4933 7.3344 -1 -3.0983 4.5351 -1 1.0965 1.7329 -1 1.4756 4.7751 -1 -1.9237 -0.018649 -1 -3.8355 0.32676 -1 5.5789 -3.3192 -1 -2.4511 -0.67166 -1 5.8085 6.0645 -1 -3.2578 -4.0852 -1 -2.9998 0.13506 -1 -3.0915 1.3737 -1 -4.1625 -3.009 -1 -5.5275 -5.7139 -1 4.6112 -1.8252 -1 -2.1982 -4.5277 -1 -4.5409 1.4609 -1 -5.2343 -3.6335 -1 6.3133 -5.0417 -1 -5.3859 -0.23118 -1 6.3354 -4.479 -1 -4.7967 0.72734 -1 -2.966 -6.2462 -1 1.5387 -0.65715 -1 -2.6625 7.4906 -1 -2.4973 3.6248 -1 4.3438 0.17437 -1 1.9394 -5.4485 -1 -3.8528 -1.8972 -1 1.1483 -6.8406 -1 4.4759 2.4832 -1 -4.3747 0.12029 -1 3.7233 -5.2136 -1 2.3754 -1.4229 -1 -1.0381 3.9534 -1 -0.18974 -1.3719 -1 1.3579 5.2827 -1 5.1344 6.9134 -1 0.55913 0.64429 -1 -4.818 -1.0339 -1 -6.1389 -4.66 -1 -0.54392 1.5351 -1 4.9994 -0.706 -1 4.9898 -6.5334 -1 -5.6496 -2.104 -1 3.6242 -3.6928 -1 4.464 -1.6731 -1 3.3974 0.16034 -1 -0.94691 4.7007 -1 -3.7249 -1.5431 -1 -1.6193 7.6096 -1 -1.9751 -2.2865 -1 5.9782 -2.2816 -1 -5.499 4.7471 -1 4.4027 -6.5977 -1 -0.48006 -4.0227 -1 -4.3616 -3.1293 -1 2.8308 7.8459 -1 3.5673 -0.60188 -1 6.167 2.3977 -1 6.2279 7.326 -1 -6.2254 2.6421 -1 3.3556 3.0495 -1 -3.8513 -0.50673 -1 -2.1658 2.2508 -1 1.2456 -6.8903 -1 -3.3019 6.1641 -1 0.070979 -1.5406 -1 -3.2609 7.5648 -1 2.9468 2.4309 -1 -0.56474 -2.3226 -1 -0.27054 -3.1012 -1 -2.485 6.2854 -1 -0.98761 1.3893 -1 -6.1605 -5.3819 -1 4.3137 -7.62 -1 -2.3717 7.4464 -1 -2.677 0.32133 -1 0.8667 6.1086 -1 3.2881 7.6549 -1 5.949 -5.6174 -1 3.9371 -5.0215 -1 -4.8079 -5.5373 -1 0.46141 -3.8188 -1 5.8141 -5.019 -1 1.8518 -4.5696 -1 4.4702 -3.7805 -1 -1.1937 -0.39054 -1 -1.7887 3.3432 -1 -3.6752 -3.8158 -1 5.8888 -2.869 -1 4.7502 -2.2679 -1 -0.61324 2.1134 -1 -3.9465 -6.8551 -1 1.1924 4.1245 -1 -0.15488 0.57528 -1 -1.2941 3.8053 -1 0.28874 -0.43841 -1 1.5294 7.9929 -1 2.1688 6.9842 -1 -2.1996 0.24757 -1 0.11939 0.22495 -1 5.6391 4.5312 -1 -4.4355 -3.8229 -1 3.6499 6.6755 -1 1.3787 -6.4042 -1 -1.6911 1.0304 -1 -0.39597 -3.4163 -1 1.5339 2.5518 -1 -0.66757 -1.787 -1 -6.0168 -0.85973 -1 0.72782 -7.0758 -1 4.0202 0.61993 -1 0.84488 3.7026 -1 -1.183 0.8396 -1 2.7949 -5.8993 -1 2.3311 7.1993 -1 2.0495 4.09 -1 -5.3304 -3.0831 -1 -4.2442 -3.5328 -1 0.79558 -5.2189 -1 4.0256 5.9052 -1 -5.5057 -6.5451 -1 -2.0786 6.6513 -1 3.1442 1.138 -1 2.975 5.1619 -1 -3.3218 4.8784 -1 0.92004 4.6088 -1 4.145 7.6729 -1 -4.8376 4.783 -1 5.1885 0.32069 -1 4.8452 -4.8085 -1 1.4223 -4.4354 -1 3.7774 -4.2538 -1 2.5299 7.9499 -1 3.7285 6.4302 -1 1.5697 5.617 -1 -4.161 0.40129 -1 3.0207 -7.5248 -1 -3.7894 -6.901 -1 1.4762 -7.7831 -1 4.7934 -0.33329 -1 -3.6731 -5.158 -1 -0.20239 4.394 -1 4.8309 -7.8099 -1 -3.2922 -2.5616 -1 5.7604 5.5706 -1 -1.9064 -4.3283 -1 -5.2855 -4.5455 -1 4.6664 7.3931 -1 3.8793 4.1104 -1 1.1225 -0.44183 -1 -4.2652 -2.9422 -1 3.1047 5.2933 -1 1.8533 -0.23408 -1 3.8429 6.6068 -1 1.9521 6.6885 -1 4.3278 7.3079 -1 -5.8087 6.2181 -1 3.6731 -5.3896 -1 4.3802 -1.8335 -1 -1.7062 7.662 -1 -4.3473 7.8695 -1 -5.9378 -4.1413 -1 -5.0565 -4.3832 -1 -2.3798 2.1034 -1 6.1812 -5.3513 -1 4.7768 -2.0083 -1 -5.654 1.1851 -1 -1.4824 -4.7542 -1 -3.9498 4.9372 -1 3.5556 -3.7813 -1 -6.1515 6.6724 -1 4.017 6.2756 -1 5.6379 5.6121 -1 4.4176 -2.9743 -1 -5.1328 -3.6111 -1 3.3823 -7.4463 -1 -0.21121 0.65504 -1 -2.6382 -5.5732 -1 -0.79968 6.8647 -1 -0.86721 -2.6187 -1 -0.23394 -6.8208 -1 1.2919 7.064 -1 ================================================ FILE: data_src/data_DBCV/dataset_4.txt ================================================ 340.080593000166 401.306241000071 1 333.985499000177 395.070042999927 1 335.612031000201 392.773647000082 1 345.092862000223 391.974363999907 1 330.569323000032 392.169848000165 1 339.312822999898 389.298434999771 1 339.686031000223 398.61877400009 1 343.316994000226 400.977901000064 1 333.065419999883 396.446630000137 1 342.511708000209 398.544327999931 1 340.766671999823 400.656415000092 1 337.277087999973 398.673247999977 1 340.923070999794 396.419065000024 1 341.683007000014 390.835886999965 1 342.708281000145 391.703137999866 1 343.201282000169 393.605033 1 341.791383000091 397.986231000163 1 331.578933999874 391.320538000204 1 342.348366000224 393.762180000078 1 338.727878999896 396.267293999903 1 334.738222999964 395.718766000122 1 330.975653999951 394.57460599998 1 344.716322999913 392.09760100022 1 341.841244999785 395.915595000144 1 337.327810999937 394.15234699985 1 337.534911999945 389.968692000024 1 342.663162999786 398.431352999993 1 331.561238999944 391.729294999968 1 334.40383999981 401.985080000013 1 332.629104000051 391.597103000153 1 335.299560000189 396.55072600022 1 336.396044999827 390.704500999767 1 342.243536999915 401.898397999816 1 339.051363999955 401.802294999827 1 343.168777000159 393.770492999814 1 339.352742000017 396.219544999767 1 340.294815000147 397.91381700011 1 335.593735000119 396.100544000044 1 331.089887000155 399.788712000009 1 337.715063000098 399.279529999942 1 335.798527000006 390.623579999898 1 342.880836999975 390.597740999889 1 343.378837999888 392.276885000058 1 336.571469000075 388.70158400014 1 338.360745999962 393.431067000143 1 332.635257999878 388.442526000086 1 343.189617000055 397.453784999903 1 332.511510000098 399.986099000089 1 336.477483999915 397.836385000031 1 332.235915000085 389.441686999984 1 275.564732000232 392.559510999825 2 271.378120000008 404.168049000204 2 275.536663000006 391.976090000011 2 277.958006000146 397.734982999973 2 281.468803000171 397.988183000125 2 276.999392999802 389.689679999836 2 277.532668000087 400.254569999874 2 283.876598999836 398.569484000094 2 275.246861999854 393.905619999859 2 280.570826000068 400.053100000136 2 284.972879000008 389.935608999804 2 272.625878999941 403.725992999971 2 284.073303999845 403.610801999923 2 275.946336000226 400.726569000166 2 275.631306000054 400.140395000111 2 271.080223999918 394.248447000049 2 271.56240000017 400.297164000105 2 273.661956999917 402.378101999871 2 281.391483999789 400.420233999845 2 273.427796999924 398.024796000216 2 282.834764000028 403.955949999858 2 275.17954300018 398.659824999981 2 283.498360000085 390.48953700019 2 271.657103999984 403.527952000033 2 282.961077999789 398.744140000083 2 284.618311999831 392.174819000065 2 279.342093000188 404.362902000081 2 270.902385999914 403.879670999944 2 280.118470000103 404.34404100012 2 279.850511000026 399.025212000124 2 282.69835200021 416.991049999837 2 284.884364999831 416.506583000068 2 278.547925000079 417.136450000107 2 272.391222000122 406.802037999965 2 277.926847000141 405.936453999951 2 275.040202999953 414.701766999904 2 280.756792999804 414.884978000075 2 276.10025600018 408.918399999849 2 279.437752000056 405.916401000228 2 272.477855999954 415.77772800019 2 284.624309999868 417.78525599977 2 280.52478600014 426.589997000061 2 286.989403000101 414.205672999844 2 274.815353000071 425.063535000198 2 285.541612000205 424.086339999922 2 275.997407999821 417.256587000098 2 278.908342999872 421.349551000167 2 275.412847999949 425.442823000252 2 287.055484999903 418.244719999842 2 277.670307000168 416.050514000002 2 280.128748999909 438.741917999927 2 280.342310000211 429.522609000094 2 284.103120999876 430.363543000072 2 294.341947000008 441.109651000239 2 287.621638000011 428.60290400032 2 288.48249799991 431.23475799989 2 285.283555000089 436.722258999944 2 286.409955000039 430.631586000323 2 289.852479999885 432.494152000174 2 292.558714999817 429.39255800005 2 300.289115000051 448.096348000225 2 301.998831000179 447.046858000103 2 296.696620999835 444.077189000323 2 292.593574999832 438.624294999987 2 295.824153999798 449.513397000264 2 294.305875000078 444.282610999886 2 303.000893000048 438.451207000297 2 294.69386699982 446.034585000016 2 295.395599999931 447.633481000084 2 298.359889999963 443.035277000163 2 298.105407999828 452.848544000182 2 298.494349000044 456.162745000329 2 307.769902999979 462.346611000132 2 297.570236000232 454.607466999907 2 308.14916000003 448.817819999997 2 306.467536999844 448.400144000072 2 300.060355999973 448.536395000294 2 298.264582999982 448.651820000261 2 301.684843999799 455.66270800028 2 308.0738309999 448.978669000324 2 310.796959000174 454.082526999991 2 312.118670000229 452.805322000291 2 311.059359000064 466.771413000301 2 318.166869999841 457.615817000158 2 314.176175000146 458.207098999992 2 316.558571000118 464.461933000013 2 307.646118999925 453.736591999885 2 317.605855000205 460.297666999977 2 310.377187999897 459.180017000064 2 311.657558000181 459.881522000302 2 328.125591999851 465.369022000115 2 331.674118999857 469.407306999899 2 318.059061000124 459.695964999963 2 327.572842000052 459.722072999924 2 325.065386999864 468.653207000345 2 323.394032999873 459.820330000017 2 327.346989999991 469.411017999984 2 323.62447200017 459.327498000115 2 327.730142999906 471.821191000286 2 320.339424000122 459.946117000189 2 336.860669000074 471.016857000068 2 329.326169000007 473.654308999889 2 336.970894000027 465.487050999887 2 328.943793999963 460.347212000284 2 341.481104999781 468.662958000321 2 331.426523000002 461.136917999946 2 331.442586999852 459.671700000297 2 340.471007000189 469.628787999973 2 334.635424000211 474.328426000196 2 334.673630000092 468.230307999998 2 345.601218000054 472.548501000274 2 340.310949999839 464.94068400003 2 341.861144000199 466.057520000264 2 342.321235999931 474.119177999906 2 339.784878999926 472.528696999885 2 348.661588000134 468.208697000053 2 344.018592999782 464.765122000128 2 346.396873999853 465.233286000323 2 343.950104000047 462.03741999995 2 344.375392000191 468.694385000039 2 281.591250000056 390.116572000086 2 274.038128000218 390.173677999992 2 274.319044000003 388.317232000176 2 281.458730000071 392.840789000038 2 277.684638000093 395.224723999854 2 282.037909999955 390.391964999959 2 277.54200100014 388.628529000096 2 277.781628999859 381.780561999884 2 275.458581999876 382.77025000006 2 278.33758000005 383.239541999996 2 270.407542999834 375.868768000044 2 276.533646999858 380.289565000217 2 276.392326000147 372.586655000225 2 277.435391000006 374.823960999958 2 281.0096450001 375.529147000052 2 270.199572999962 375.50949899992 2 280.601464000065 379.713717000093 2 272.352343999781 380.397797000129 2 272.74782600021 371.799476999789 2 279.051520000212 385.182934000157 2 276.032953999937 373.474593000021 2 276.04079400003 371.569041000213 2 283.93596199993 372.577940999996 2 282.358872999903 376.170775000006 2 273.135604999959 374.693434000015 2 275.436521000229 366.388801000081 2 284.604973000009 361.595015000086 2 279.457003000192 364.791898999829 2 279.324618000071 373.007619999815 2 283.907215999905 372.13773099985 2 286.724506000057 362.70857599983 2 287.665115000214 356.211618000176 2 285.225418999791 365.002741999924 2 285.955506999977 367.179124999791 2 292.162618999835 357.082713000011 2 283.527553999797 355.450588999782 2 290.19073599996 354.795212000143 2 281.79827499995 364.437746000011 2 286.016962000169 364.524929000065 2 285.490943999961 356.040585000068 2 289.584323999938 353.874623999931 2 292.188362999819 354.931929999962 2 290.921076999977 351.638449999969 2 295.706486000214 347.77908599982 2 293.876329000108 344.007606999949 2 287.630673000123 356.822922000196 2 292.76003200002 346.940886000171 2 291.874253999908 347.710423999932 2 284.013354999945 351.794995999895 2 292.967625999823 361.836296000052 2 280.614341000095 362.254129999783 2 283.180738000199 359.081335000228 2 288.563686999958 352.996989000123 2 290.206956000067 359.305296000093 2 285.380559999961 360.669691000134 2 280.855977000203 352.008268999867 2 287.071576999966 356.453741999809 2 289.397274999879 358.707884000149 2 292.32484500017 350.750994999893 2 297.766363000032 343.718977000099 2 295.469016999938 352.890833000187 2 286.694364999887 346.289671999868 2 294.82860499993 353.694565999787 2 286.33236300014 341.491194999777 2 293.489422000013 343.918444999959 2 293.222575999796 343.745008000173 2 289.931313999929 341.977434999775 2 292.469142999966 345.385646999814 2 303.328251000028 344.780511000194 2 304.068678000011 340.544828999788 2 293.718036000151 338.427188999951 2 303.741065999959 343.235077999998 2 307.100600000005 336.871704000048 2 307.09996900009 341.692346999887 2 302.759434999898 337.83112199977 2 301.137457000092 346.819325999822 2 298.115906999912 348.274538999889 2 301.878403000068 342.889849000145 2 313.32788700005 344.940977000166 2 301.189420000184 346.244235000107 2 313.613148000091 341.952339000069 2 310.804626999889 345.506359000225 2 307.120536999777 342.781421999913 2 307.54880299978 333.738884000108 2 309.184045999777 341.367395999841 2 309.930445999838 341.127594000194 2 301.483397000004 340.018436999992 2 314.124737999868 336.966264000162 2 316.484484000131 330.42635500012 2 312.664367000107 342.642039000057 2 316.852740999777 328.654779999983 2 313.073115000036 333.346878999844 2 317.070995000191 341.424494999927 2 322.053408000153 338.632120999973 2 308.055980000179 337.719942000229 2 318.092466000002 332.764543000143 2 316.209414000157 340.604559999891 2 321.762986999936 332.181195999961 2 324.187342999969 336.267490999773 2 330.833409999963 335.663953999989 2 337.203918999992 347.113196999766 2 325.619167000055 343.377601000015 2 336.854183999822 335.616034000181 2 325.100755999796 342.157753999811 2 331.17795300018 337.049343999941 2 326.799414000008 333.708540000021 2 331.353066999931 335.78148799995 2 323.835491000209 346.367407999933 2 328.04509499995 335.579245999921 2 333.326241999865 339.170746000018 2 330.170179999899 334.505876999814 2 330.410356000066 332.77614099998 2 328.951834000181 326.862077000085 2 330.768906999845 332.976048000157 2 339.536853000056 333.274954999797 2 330.71094099991 329.752326999791 2 340.630439999979 340.895986999851 2 342.341614999808 332.016452000011 2 333.59476199979 332.964519000147 2 336.322652000003 330.620709000155 2 335.530784999952 334.798208000138 2 332.734618999995 334.91174999997 2 341.056561000179 333.587681999896 2 345.079256999772 329.809594000224 2 333.05690599978 328.164487000089 2 344.170750999823 340.133692999836 2 339.232667999808 339.78389800014 2 336.185130999889 331.491934999824 2 288.439141000155 429.814830000047 2 277.240346000064 421.534643999767 2 288.705926000141 432.914245000109 2 285.61660300009 420.086494999938 2 284.781975000165 423.352613999974 2 289.356124000158 431.610005999915 2 276.669199999887 428.340828000102 2 282.552151000127 423.650437999982 2 282.833926999941 433.334122000262 2 275.244181999937 423.181933000218 2 295.111847000197 440.70355800027 2 288.956797000021 441.54499100009 2 294.225846999791 446.283021000214 2 285.297542000189 443.724222999997 2 288.548696999904 435.404792000074 2 286.853298999835 445.89872100018 2 284.027945999987 445.659014000092 2 290.283964000177 445.466033000033 2 296.440758000128 448.085398999974 2 292.013282000087 440.945854000282 2 293.315739000216 454.471840000246 2 291.641048000194 451.673665999901 2 304.441722999793 446.362951000221 2 291.323487999849 453.85565600032 2 304.139917000197 446.935756999999 2 302.618131999858 444.201373000164 2 295.255117000081 447.44844800001 2 293.073137000203 444.524476000108 2 297.196814000141 447.21465400001 2 290.564933999907 451.969909999985 2 316.160480999853 462.161685000174 2 313.438000000082 447.888195000123 2 315.721752999816 451.953838000074 2 315.141911000013 447.759120000061 2 310.845393000171 456.790971000213 2 309.446165000089 449.9552480001 2 310.442799999844 455.088514000177 2 315.661547999829 459.732952999882 2 315.430583000183 460.020981000271 2 310.377650000155 460.583492999896 2 337.770816999953 464.291306000203 2 326.218851999845 469.661887000315 2 339.849760000128 473.900303000119 2 330.132836999837 465.409437000286 2 333.963119000196 466.885577999987 2 328.408026000019 463.216091000009 2 338.064695999958 470.464205000084 2 331.278638999909 464.062855000142 2 337.738963999785 473.016298000235 2 331.581642999779 473.827144999988 2 278.063279999886 382.482284000143 2 283.170090000145 379.762041000184 2 284.882455999963 377.243555999827 2 286.762548999861 376.382792999968 2 285.977713000029 375.200914000161 2 282.663991999812 379.152185999788 2 282.158846000209 377.099086000118 2 284.432273000013 372.623476999812 2 280.196934000123 375.145907999948 2 286.962050000206 381.40418299986 2 310.786869000178 347.103199000005 2 311.337789000012 346.479408999905 2 310.680124000181 341.763600999955 2 306.215865000151 342.121199000161 2 302.647224000189 334.363262999803 2 304.464583999943 337.528198999818 2 302.930536999833 338.498699000105 2 315.309626000002 339.280749999918 2 311.334650999866 345.365472000092 2 307.930645999964 339.443221000023 2 316.644803000148 331.834722000174 2 329.77025000006 340.136498000007 2 330.69882800011 335.393705999944 2 321.975273999851 338.384099000134 2 316.734904999845 330.654447999783 2 324.408317999914 332.823224999942 2 330.077132000122 333.456172999926 2 321.82465799991 338.670402999967 2 325.928280000109 340.108022000175 2 327.313602000009 336.05405799998 2 331.96966700023 340.252166000195 2 339.144123999868 341.612036999781 2 328.344622999895 337.307053999975 2 335.398107000161 336.816229999997 2 329.37205900019 339.833136000205 2 339.925007999875 341.027007000055 2 330.932599999942 334.75419700006 2 327.721384000033 334.934183999896 2 337.464132000227 329.19598999992 2 337.832541999873 338.690866000019 2 428.810889000073 499.729580000043 3 425.393627000041 493.410670000128 3 424.475703999866 500.755513000302 3 415.510985999834 491.603397000115 3 419.541222000029 490.648371000309 3 428.113410000224 492.412991000339 3 425.608328999951 488.619296000339 3 427.317301000003 498.1468410003 3 427.943053999916 498.07325399993 3 428.672300999984 496.934424000327 3 433.372045000084 502.141772999894 3 435.874388999771 501.390268000308 3 430.950474999845 491.318884999957 3 437.752884999849 492.625738000032 3 439.392155999783 502.221375999972 3 436.655656999908 496.187112000305 3 440.452285999898 491.369898000266 3 434.746292000171 501.172223000322 3 440.312845000066 493.24883500021 3 438.881428000052 495.655058000237 3 438.524873000104 493.75810000021 3 441.167762000114 491.661625999957 3 443.836159999948 496.010120999999 3 448.038234999869 500.792133000214 3 440.090555999894 493.597029000055 3 448.841490000021 492.925735000055 3 439.478817000054 497.346042000223 3 446.786419000011 499.043122000061 3 438.221760000102 494.281751000322 3 437.356751000043 502.442203999963 3 449.492473000195 492.53163299989 3 459.745124999899 489.286241000053 3 447.223060999997 501.440026999917 3 447.245114000048 487.333358000033 3 450.467995000072 496.733677000273 3 458.580765999854 489.658048000187 3 453.71749900002 491.682899000123 3 450.043326999992 500.339963000268 3 455.143877000082 496.32339300029 3 453.2182169999 496.382433000021 3 462.335330999922 490.786952999886 3 467.451291000005 492.437547000125 3 459.826137000229 496.025311000179 3 464.071967999917 493.899759000167 3 456.547443000134 497.955181000289 3 468.249412999954 486.334616000298 3 465.482481000014 484.134764000308 3 471.070791999809 483.665415999945 3 461.469093999825 496.853117000312 3 464.949471000116 498.282124000136 3 477.476850999985 487.796128000133 3 473.19994200021 481.32956700027 3 475.650100999977 481.682930999901 3 472.407899999991 490.043781999964 3 478.489666999783 484.528574000113 3 477.478930999991 490.609395000152 3 468.97280200012 490.981703000143 3 468.745244000107 487.238164000213 3 472.92670099996 486.226412999909 3 469.936017000116 481.785910000093 3 483.847126000095 475.061950000003 3 480.692332999781 478.47275400022 3 487.595226000063 475.968164999969 3 488.596621000208 468.429765000008 3 487.141553999856 472.961444000248 3 488.096462999936 473.145284000318 3 484.760805000085 472.528414000291 3 488.708178000059 479.04041399993 3 479.584040000103 466.580910000019 3 484.286896999925 468.517949000001 3 495.708829000127 468.662524000276 3 498.306367999874 463.256289000157 3 490.967838000041 466.609476000071 3 489.33916899981 474.366789000109 3 488.481294000056 461.520047999918 3 488.358967999928 461.594925000332 3 489.752803000156 472.862289000303 3 484.34510200005 472.196117000189 3 488.994771999773 460.876666999888 3 490.832779000048 468.55657200003 3 490.678313000128 455.870581999887 3 489.974518000148 462.944128000177 3 497.031692999881 462.044984000269 3 484.53761800006 457.245074999984 3 488.48476300016 452.290405000094 3 484.257629999891 461.404545000289 3 496.414187000133 461.387347000185 3 493.16295599984 454.375983000267 3 491.825439999811 454.194912000094 3 488.775708000176 463.2106949999 3 481.173328999896 443.654618000146 3 494.338107000105 439.560624999925 3 485.760642999783 452.174839999992 3 484.106364999898 445.605880000163 3 487.434642000124 442.638418999966 3 480.945791000035 443.996179000009 3 485.675890999846 444.819050000049 3 481.355332999956 449.149621000048 3 495.079679000191 451.061327999923 3 493.256802000105 441.983272999991 3 488.721293999813 434.025133000221 3 479.183360000141 430.097661999986 3 484.496737999842 434.698993999977 3 485.407277999911 439.268786000088 3 485.903371999972 432.775454000104 3 477.372338000219 438.229150000028 3 474.560936000198 440.252245000098 3 475.857702000067 432.993828000035 3 482.089610999916 444.30068500014 3 474.89568899991 432.706654000096 3 478.456371000037 430.672968000174 3 472.835409999825 425.344663999975 3 469.408571000211 431.490079999901 3 467.778229999822 430.893924999982 3 468.319089999888 423.78520500008 3 466.404827999882 430.545737999957 3 474.656531999819 436.512926999945 3 473.704237000085 434.295117999893 3 469.99489099998 422.807921000291 3 470.22989499988 431.990844999906 3 468.125308000017 427.692645000294 3 459.370287000202 434.228566000238 3 465.574583999813 427.847649999894 3 471.47603000002 424.857528999913 3 458.833753000014 423.569963000249 3 457.583232000005 431.679839000106 3 463.169875000138 428.454710999969 3 466.314158000052 427.3446630002 3 462.784620000049 433.005989999976 3 458.836068000179 426.326979000121 3 452.97005000012 423.382490000222 3 458.167239000089 418.44309400022 3 453.854613000061 421.438537000213 3 457.294453000184 421.76099300012 3 457.539230000228 421.425176999997 3 456.41126499977 425.433713000268 3 455.357557999901 424.094777000137 3 450.902567999903 415.553141000215 3 458.018389000092 418.812316999771 3 455.648783999961 422.940563000273 3 440.518015000038 417.753004999831 3 438.012492000125 412.980132000055 3 447.59070800012 422.962691000197 3 438.595366000198 419.423899000045 3 451.007958999835 415.56221299991 3 447.837811000179 423.666684000287 3 439.242604000028 423.527970999945 3 440.404937000014 418.397346999962 3 440.028320999816 416.051675000228 3 447.287671000231 417.663943999913 3 438.35898800008 414.573799000122 3 426.627871000208 416.068214000203 3 426.516830000095 414.542373000178 3 428.430649999995 416.653237999883 3 436.461378000211 416.150005999953 3 440.34707200015 419.265943999868 3 427.584884000011 408.791635999922 3 439.871456000023 407.839774999768 3 429.444986999966 412.304796000011 3 430.937460999936 405.849741999991 3 428.969322999939 415.703420999926 3 423.408722999971 406.841905999929 3 419.644838000182 405.488836999983 3 427.421575000044 403.303555000108 3 428.579102999996 410.244613000192 3 428.30893300008 401.526786999777 3 424.206778999884 408.167326000053 3 424.488547000103 411.169675999787 3 420.753368000034 409.300112999976 3 421.141454000026 403.093392999843 3 422.575889000203 395.963992000092 3 415.008001000155 394.48915300006 3 423.418938000221 402.424097000156 3 420.429361999966 398.279748000205 3 427.444771000184 400.071800000034 3 414.952022999991 393.633016000036 3 421.983272000216 390.16661400022 3 413.9837460001 393.733527000062 3 426.244646999985 394.742223000154 3 418.182521000039 399.909256999847 3 420.358260999899 375.894729999825 3 427.869353999849 388.467025999911 3 428.307055000216 382.592672999948 3 419.557771999855 382.19579999987 3 428.50125099998 388.969442999922 3 418.244741000235 379.359017999843 3 417.434104999993 376.864184000064 3 419.457435000222 387.456309999805 3 428.179967000149 386.792590999976 3 417.430131000001 385.002572000027 3 422.263489999808 372.909155999776 3 423.511117999908 364.854987000115 3 426.30204299977 373.600267999806 3 425.074638000224 369.74961300008 3 419.22969500022 367.148517999798 3 425.448702000082 367.063457999844 3 427.935936000198 376.258545000106 3 421.599621000234 378.26745699998 3 429.803634000011 379.078275000211 3 428.726489999797 366.422532000113 3 429.293459999841 370.441796000116 3 422.024081000127 365.716256999876 3 422.080856999848 360.452484999783 3 434.971530999988 367.278249000199 3 436.256130999885 362.242738000117 3 424.723292000126 369.823642999865 3 432.911468999926 362.537899000105 3 427.971251000185 357.049992000218 3 434.024201000109 357.875876000151 3 436.15306300018 369.883030999918 3 440.152916000225 351.768310000189 3 436.668070000131 360.135976000223 3 430.671628000215 355.613342000172 3 434.786071000155 355.149490999989 3 435.98977799993 360.07734999992 3 442.275245999917 353.242513000034 3 439.905375999864 348.219847000204 3 444.840218999889 354.630938000046 3 433.893869999796 358.638927000109 3 430.86381699983 354.284175999928 3 442.700879999902 346.138824999798 3 446.513323000167 354.673012999818 3 439.407722999807 357.137496000156 3 446.086984000169 352.812396999914 3 444.458507999778 354.214385000058 3 448.701793000102 351.475484000053 3 438.790421000216 347.653041000012 3 449.342588 357.475145000033 3 445.903801999986 355.878285000101 3 437.50270299986 349.007557000034 3 450.747810999863 344.582192999776 3 452.843367999885 344.103289999999 3 454.279118999839 344.040136000142 3 444.908001999836 347.374592000153 3 453.119206999894 343.011994999833 3 453.410854999907 336.931435000151 3 445.570268999785 341.626203999855 3 455.76273200009 344.204905999824 3 453.301330999937 346.165316000115 3 443.662442000117 343.893095999956 3 450.592554000206 341.27203800017 3 451.640319999773 344.317819000222 3 457.559148000088 338.631022999994 3 460.726425000001 345.278055999894 3 451.824684000108 341.867209000047 3 452.238888000138 331.685310000088 3 452.757952000014 338.266270000022 3 455.351141999941 333.413383999839 3 448.535099999979 332.97164299991 3 455.475157999899 341.506680000108 3 460.855047000106 338.930375999771 3 457.078596000094 330.556702999864 3 453.813031000085 332.122059999965 3 454.459245000035 335.722775999922 3 463.812841000035 332.088324999902 3 454.220263000112 336.861219000071 3 453.876569999848 335.621797999833 3 464.133146000095 340.191471000202 3 463.772503000218 333.303650999907 3 465.428991000168 337.623219000176 3 468.293345999904 323.258880999871 3 461.778946999926 325.126213999931 3 459.180929000024 325.846413000021 3 462.312686999794 332.107357000001 3 470.991812000051 323.379896999802 3 465.851267999969 322.656903000083 3 460.669528000057 324.675607999787 3 466.893639000133 330.128490000032 3 468.453945999965 327.243917000014 3 470.58936699992 324.017206999939 3 472.794015999883 326.346493000165 3 462.414896000177 327.932380000129 3 472.940739999991 320.993077000137 3 466.85614300007 329.479348999914 3 470.819778999779 326.629040999804 3 468.017585999798 319.480458999984 3 474.199337000027 325.671914999839 3 473.76885400014 317.551202999894 3 461.43383899983 317.940142000094 3 468.314850000199 324.931464999914 3 423.049947000109 461.602417000104 4 415.882664999925 462.230109999888 4 422.901542000007 463.280180000234 4 424.773157000076 461.591086000204 4 425.97969100019 463.723321999889 4 417.142891000025 458.040789000224 4 426.111376000103 451.80352600012 4 414.850494000129 454.950883999933 4 424.364312000107 456.30208500009 4 414.640132000204 459.097361000255 4 421.098943999968 465.900079000276 4 415.243125999812 460.873907000292 4 426.502336000092 463.425848000217 4 419.553964000195 461.354702000041 4 420.256450999994 462.199783000164 4 426.286921999883 453.050778999925 4 418.101619999856 455.696752000134 4 418.435087999795 457.909536000341 4 426.986647000071 459.493770000059 4 422.758165999781 452.658585000318 4 429.253837999888 454.370699000079 4 426.566916000098 466.358785000164 4 426.117418999784 457.360111000016 4 423.799103000201 465.722519000061 4 419.570911000017 454.25428700028 4 425.907724000048 465.343441000208 4 415.338547000196 462.592960000038 4 425.639667000156 463.009326000232 4 418.12036800012 464.431340000127 4 421.191784000024 464.121416999958 4 466.573479000013 388.56095399987 5 473.242182999849 388.505501999985 5 477.989756000228 380.362852999941 5 468.493040999863 382.565651000012 5 478.035753999837 383.804427000228 5 465.900979999918 382.578755999915 5 467.047054000199 376.6596789998 5 465.586457999889 376.740127999801 5 466.30907699978 384.941691999789 5 466.533137999941 388.636098000221 5 475.669772000052 382.141518999822 5 479.811947000213 376.880171000026 5 474.566833000164 384.886010999791 5 468.031305999961 380.62018099986 5 478.953275999986 386.168779000174 5 477.305238999892 376.998159999959 5 472.392535000108 376.989792999811 5 479.798206999898 378.12027399987 5 470.961451000068 377.676510000136 5 466.5645750002 389.006446000189 5 468.762294999789 378.853031999897 5 477.20013799984 386.128392999992 5 477.790004000068 389.397962999996 5 472.586430999916 381.000587000046 5 467.618739999831 378.265711000189 5 468.390726000071 390.262949999887 5 474.892611999996 383.943504999857 5 479.888063999824 382.734370000195 5 470.796721999999 382.858122000005 5 469.433087000158 382.949008000083 5 473.588192000054 379.197623000015 5 470.089219999965 388.71069200011 5 479.300497000106 386.932823999785 5 478.489159000106 378.533429000061 5 475.063924999908 388.783824999817 5 465.624900999945 377.243751999922 5 477.853277999908 386.701561999973 5 478.867014999967 376.112596000079 5 466.944978000131 389.898415999953 5 475.431175000034 381.42365900008 5 397.510631000157 318.779397999868 6 393.182192000095 311.378210000228 6 405.59302699985 313.85961999977 6 400.207744999789 314.678629000206 6 399.142376000062 319.893732000142 6 395.942474999931 312.430680999998 6 403.492874000221 311.391166999936 6 403.187113999855 312.641197999939 6 394.807862999849 315.297671999782 6 399.650669999886 318.326419000048 6 398.619932999834 309.981873000041 6 406.447836000007 313.045142000075 6 394.85633599991 306.184570999816 6 406.718776999973 314.228234999813 6 396.452070999891 309.648914000019 6 402.186852000188 312.914503999986 6 398.009571999777 306.306824999861 6 394.72197900014 300.856654000003 6 401.503841000143 306.34066199977 6 394.375514999963 301.732228000183 6 408.25665100012 307.194850999862 6 404.499704999849 310.703956000041 6 398.461693999823 300.209472000133 6 397.59167600004 302.597422000021 6 411.33420599997 303.615460999776 6 409.380115000065 301.257648999803 6 411.549291000236 307.317329999991 6 400.192313999869 299.548905999865 6 408.531942000147 309.208424000069 6 406.719988000114 301.970383999869 6 410.583399999887 302.046819999814 6 404.96436699992 312.242717999965 6 401.784442000091 305.12842199998 6 410.006078999955 310.366615999956 6 401.597389000002 313.42785899993 6 414.056191000156 310.538738999981 6 407.827461000066 315.803555000108 6 414.649985000025 304.695712000132 6 408.976344999857 302.572995000053 6 402.320390000008 303.321086000185 6 388.082185000181 312.699907000177 6 389.509529999923 308.985661000013 6 384.867300000042 306.26948300004 6 390.55796500016 307.401546000037 6 398.799327000044 310.043293000199 6 387.063124000095 315.284988000058 6 385.934568000026 303.611659999937 6 387.964254000224 307.235400999896 6 388.047687000129 312.992616999894 6 389.964410000015 314.521449999884 6 386.692520000041 295.829528000206 6 394.290382000152 296.759353000205 6 385.995306999888 309.215948000085 6 397.548890999984 306.051862000022 6 398.91705300007 301.137899999972 6 398.024985000025 294.545830999967 6 390.556464000139 299.987350000069 6 395.581344999839 307.565301999915 6 387.356068000197 306.987933999859 6 388.168403000105 308.765101999976 6 397.505044000223 306.712673999835 6 395.604325999971 299.74879400013 6 402.714682000224 300.474165999796 6 406.542411999777 297.360909999814 6 397.990000999998 293.174902999774 6 403.867875999771 305.508750000037 6 398.770523999818 304.696136999875 6 399.030168999918 301.416486999951 6 402.828209000174 300.463953999802 6 399.719417999964 296.280708000064 6 406.255197000224 297.83671199996 6 410.982671999838 293.501579999924 6 406.042142999824 304.310072000138 6 398.845921 301.682546999771 6 399.413953000214 302.254333000164 6 399.175451999996 304.443797000218 6 410.2477330002 296.46242700005 6 408.364271000028 296.781977000181 6 410.001900999807 296.184016000014 6 398.705327000003 298.27604399994 6 364.387544000056 443.955205000006 -1 379.862741999794 467.835111000109 -1 370.757472000085 513.521511000581 -1 416.86848400021 524.035082000308 -1 311.539015999995 523.082759000361 -1 284.837520000059 487.712935999967 -1 265.537488000002 517.137047000229 -1 265.406068000011 467.304047000129 -1 331.243737000041 503.141515000258 -1 393.481540999841 431.555767999962 -1 455.248877999838 465.751176000107 -1 433.485898999963 440.123194000218 -1 341.20297600003 430.338018999901 -1 361.197627000045 414.171761000063 -1 392.739544999786 411.681011000182 -1 362.739223000128 364.252307000104 -1 380.037742000073 395.467621999793 -1 400.507069000043 365.464221000206 -1 377.944128999952 336.997504000086 -1 416.090559999924 338.672298000194 -1 444.825889000203 307.228933999781 -1 493.477363000158 300.292936999816 -1 450.57735500019 285.058660999872 -1 487.091688999906 274.915289999917 -1 428.451733000111 271.475777999964 -1 353.462441999931 268.77300899988 -1 332.63990199985 300.980235999916 -1 294.568130999804 269.832191000227 -1 286.768153000157 309.40664099995 -1 254.155410999898 273.011171999853 -1 255.736754999962 383.043938999996 -1 255.578135000076 321.797772999853 -1 269.571206999943 291.691631999798 -1 318.953784000129 371.279631999787 -1 322.208068999927 419.043138000183 -1 302.684162000194 410.35325299995 -1 339.387234999798 367.536927999929 -1 297.637213999871 384.627384999767 -1 311.072525999974 393.566060000099 -1 377.536865999922 377.286367000081 -1 417.072476999834 433.68673299998 -1 396.357729999814 452.929053000174 -1 399.783197000157 485.216498000082 -1 362.771583000198 492.148952000309 -1 308.225172999781 492.362728999928 -1 286.942470000125 516.371205000207 -1 264.795897000004 498.772009999957 -1 454.217315000016 522.212020000443 -1 485.410180000123 522.141029000282 -1 515.508918000385 520.376193000004 -1 515.950048999861 493.279626999982 -1 517.4182099998 456.858341000043 -1 517.196076000109 419.766627999954 -1 519.392636000179 382.963473000098 -1 517.66203899961 356.089052000083 -1 520.467551999725 314.412516000215 -1 518.004772000015 280.905362999998 -1 499.387498000171 337.548436000012 -1 478.931082000025 356.823363000061 -1 497.154240000062 415.408559999894 -1 497.773620999884 377.833358000033 -1 445.298427000176 391.178739000112 -1 470.157571999822 409.287219999824 -1 357.421777000185 314.294974999968 -1 384.430027999915 277.063980999868 -1 268.806625999976 337.208126999903 -1 258.773949999828 431.312564000022 -1 ================================================ FILE: data_src/data_DBCV/read_data.R ================================================ library(dbscan) x <- read.table("Work/data_DBCV/dataset_1.txt") colnames(x) <- c("x", "y", "class") cl <- x[, 3] cl[cl < 0] <- 0 x[, 3] <- cl plot(x[, 1:2], col = x[, 3] + 1L, asp = 1) Dataset_1 <- x save(Dataset_1, file="data/Dataset_1.rda", version = 2) x <- read.table("Work/data_DBCV/dataset_2.txt") colnames(x) <- c("x", "y", "class") cl <- x[, 3] cl[cl < 0] <- 0 x[, 3] <- cl clplot(x[, 1:2], x[, 3]) Dataset_2 <- x save(Dataset_2, file="data/Dataset_2.rda", version = 2) x <- read.table("Work/data_DBCV/dataset_3.txt") colnames(x) <- c("x", "y", "class") cl <- x[, 3] cl[cl < 0] <- 0 x[, 3] <- cl clplot(x[, 1:2], x[, 3]) Dataset_3 <- x save(Dataset_3, file="data/Dataset_3.rda", version = 2) x <- read.table("Work/data_DBCV/dataset_4.txt") colnames(x) <- c("x", "y", "class") cl <- x[, 3] cl[cl < 0] <- 0 x[, 3] <- cl clplot(x[, 1:2], x[, 3]) Dataset_4 <- x save(Dataset_4, file="data/Dataset_4.rda", version = 2) ================================================ FILE: data_src/data_DBCV/test_DBCV.R ================================================ # From: https://github.com/FelSiq/DBCV # # Dataset Python (Scipy's Kruskal's) Python (Translated MST algorithm) MATLAB # dataset_1.txt 0.8566 0.8576 0.8576 # dataset_2.txt 0.5405 0.8103 0.8103 # dataset_3.txt 0.6308 0.6319 0.6319 # dataset_4.txt 0.8456 0.8688 0.8688 # # Original MATLAB implementation is at: # https://github.com/pajaskowiak/dbcv/tree/main/data res <- c() data(Dataset_1) x <- Dataset_1[, c("x", "y")] class <- Dataset_1$class #clplot(x, class) (db <- dbcv(x, class, metric = "sqeuclidean")) res["ds1"] <- db$score #dsc [0.00457826 0.00457826 0.0183068 0.0183068 ] #dspc [0.85627898 0.85627898 0.85627898 0.85627898] #vcs [0.99465331 0.99465331 0.97862052 0.97862052] #0.8575741400490697 data(Dataset_2) x <- Dataset_2[, c("x", "y")] class <- Dataset_2$class #clplot(x, class) (db <- dbcv(x, class, metric = "sqeuclidean")) res["ds2"] <- db$score #dsc [19.06151967 15.6082 83.71522964 68.969 ] #dspc [860.2538 501.4376 501.4376 860.2538] #vcs [0.97784198 0.9688731 0.83304956 0.91982715] #0.8103343589093096 data(Dataset_3) x <- Dataset_3[, c("x", "y")] class <- Dataset_3$class #clplot(x, class) (db <- dbcv(x, class, metric = "sqeuclidean")) res["ds3"] <- db$score data(Dataset_4) x <- Dataset_4[, c("x", "y")] class <- Dataset_4$class #clplot(x, class) (db <- dbcv(x, class, metric = "sqeuclidean")) res["ds4"] <- db$score cbind(dbscan = round(res, 2), MATLAB = c(0.85, 0.81, 0.63, 0.87)) ================================================ FILE: data_src/data_chameleon/read.R ================================================ # Source: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download chameleon_ds4 <- read.table("t4.8k.dat") chameleon_ds5 <- read.table("t5.8k.dat") chameleon_ds7 <- read.table("t7.10k.dat") chameleon_ds8 <- read.table("t8.8k.dat") colnames(chameleon_ds4) <- colnames(chameleon_ds5) <- colnames(chameleon_ds7) <- colnames(chameleon_ds8) <- c("x", "y") plot(chameleon_ds4) plot(chameleon_ds5) plot(chameleon_ds7) plot(chameleon_ds8) save(chameleon_ds4, chameleon_ds5, chameleon_ds7, chameleon_ds8, file="Chameleon.rda") ================================================ FILE: dbscan.Rproj ================================================ Version: 1.0 ProjectId: 6c2ba941-cfaa-4faa-ba72-88eeef0391b8 RestoreWorkspace: Default SaveWorkspace: Default AlwaysSaveHistory: Default EnableCodeIndexing: Yes UseSpacesForTab: Yes NumSpacesForTab: 2 Encoding: UTF-8 RnwWeave: Sweave LaTeX: pdfLaTeX AutoAppendNewline: Yes StripTrailingWhitespace: Yes BuildType: Package PackageUseDevtools: Yes PackageCleanBeforeInstall: No PackageInstallArgs: --no-multiarch --with-keep.source PackageBuildArgs: --compact-vignettes=both PackageCheckArgs: --as-cran PackageRoxygenize: rd,collate,namespace ================================================ FILE: inst/CITATION ================================================ citation(auto = meta) bibentry(bibtype = "Article", title = "{dbscan}: Fast Density-Based Clustering with {R}", author = c(person(given = "Michael", family = "Hahsler", email = "mhahsler@lyle.smu.edu", comment = c(ORCID = "0000-0003-2716-1405")), person(given = "Matthew", family = "Piekenbrock"), person(given = "Derek", family = "Doran", email = "derek.doran@wright.edu")), journal = "Journal of Statistical Software", year = "2019", volume = "91", number = "1", pages = "1--30", doi = "10.18637/jss.v091.i01", header = "To cite dbscan in publications use:" ) ================================================ FILE: man/DBCV_datasets.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/DBCV_datasets.R \docType{data} \name{DBCV_datasets} \alias{DBCV_datasets} \alias{Dataset_1} \alias{Dataset_2} \alias{Dataset_3} \alias{Dataset_4} \title{DBCV Paper Datasets} \format{ Four data frames with the following 3 variables. \describe{ \item{x}{a numeric vector} \item{y}{a numeric vector} \item{class}{an integer vector indicating the class label. 0 means noise.} } } \source{ https://github.com/pajaskowiak/dbcv } \description{ The four synthetic 2D datasets used in Moulavi et al (2014). } \examples{ data("Dataset_1") clplot(Dataset_1[, c("x", "y")], cl = Dataset_1$class) data("Dataset_2") clplot(Dataset_2[, c("x", "y")], cl = Dataset_2$class) data("Dataset_3") clplot(Dataset_3[, c("x", "y")], cl = Dataset_3$class) data("Dataset_4") clplot(Dataset_4[, c("x", "y")], cl = Dataset_4$class) } \references{ Davoud Moulavi and Pablo A. Jaskowiak and Ricardo J. G. B. Campello and Arthur Zimek and Jörg Sander (2014). Density-Based Clustering Validation. In \emph{Proceedings of the 2014 SIAM International Conference on Data Mining,} pages 839-847 \doi{10.1137/1.9781611973440.96} } \keyword{datasets} ================================================ FILE: man/DS3.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/DS3.R \docType{data} \name{DS3} \alias{DS3} \title{DS3: Spatial data with arbitrary shapes} \format{ A data.frame with 8000 observations on the following 2 columns: \describe{ \item{X}{a numeric vector} \item{Y}{a numeric vector} } } \source{ Obtained from \url{http://cs.joensuu.fi/sipu/datasets/} } \description{ Contains 8000 2-d points, with 6 "natural" looking shapes, all of which have an sinusoid-like shape that intersects with each cluster. The data set was originally used as a benchmark data set for the Chameleon clustering algorithm (Karypis, Han and Kumar, 1999) to illustrate the a data set containing arbitrarily shaped spatial data surrounded by both noise and artifacts. } \examples{ data(DS3) plot(DS3, pch = 20, cex = 0.25) } \references{ Karypis, George, Eui-Hong Han, and Vipin Kumar (1999). Chameleon: Hierarchical clustering using dynamic modeling. \emph{Computer} 32(8): 68-75. } \keyword{datasets} ================================================ FILE: man/NN.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/NN.R \name{NN} \alias{NN} \alias{adjacencylist} \alias{adjacencylist.NN} \alias{sort.NN} \alias{plot.NN} \title{NN --- Nearest Neighbors Superclass} \usage{ adjacencylist(x, ...) \method{adjacencylist}{NN}(x, ...) \method{sort}{NN}(x, decreasing = FALSE, ...) \method{plot}{NN}(x, data, main = NULL, pch = 16, col = NULL, linecol = "gray", ...) } \arguments{ \item{x}{a \code{NN} object} \item{...}{further parameters past on to \code{\link[=plot]{plot()}}.} \item{decreasing}{sort in decreasing order?} \item{data}{that was used to create \code{x}} \item{main}{title} \item{pch}{plotting character.} \item{col}{color used for the data points (nodes).} \item{linecol}{color used for edges.} } \description{ NN is an abstract S3 superclass for the classes of the objects returned by \code{\link[=kNN]{kNN()}}, \code{\link[=frNN]{frNN()}} and \code{\link[=sNN]{sNN()}}. Methods for sorting, plotting and getting an adjacency list are defined. } \section{Subclasses}{ \link{kNN}, \link{frNN} and \link{sNN} } \examples{ data(iris) x <- iris[, -5] # finding kNN directly in data (using a kd-tree) nn <- kNN(x, k=5) nn # plot the kNN where NN are shown as line conecting points. plot(nn, x) # show the first few elements of the adjacency list head(adjacencylist(nn)) \dontrun{ # create a graph and find connected components (if igraph is installed) library("igraph") g <- graph_from_adj_list(adjacencylist(nn)) comp <- components(g) plot(x, col = comp$membership) # detect clusters (communities) with the label propagation algorithm cl <- membership(cluster_label_prop(g)) plot(x, col = cl) } } \seealso{ Other NN functions: \code{\link{comps}()}, \code{\link{frNN}()}, \code{\link{kNN}()}, \code{\link{kNNdist}()}, \code{\link{sNN}()} } \author{ Michael Hahsler } \concept{NN functions} \keyword{model} ================================================ FILE: man/comps.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/comps.R \name{comps} \alias{comps} \alias{components} \alias{comps.dist} \alias{comps.kNN} \alias{comps.sNN} \alias{comps.frNN} \title{Find Connected Components in a Nearest-neighbor Graph} \usage{ comps(x, ...) \method{comps}{dist}(x, eps, ...) \method{comps}{kNN}(x, mutual = FALSE, ...) \method{comps}{sNN}(x, ...) \method{comps}{frNN}(x, ...) } \arguments{ \item{x}{the \link{NN} object representing the graph or a \link{dist} object} \item{...}{further arguments are currently unused.} \item{eps}{threshold on the distance} \item{mutual}{for a pair of points, do both have to be in each other's neighborhood?} } \value{ an integer vector with component assignments. } \description{ Generic function and methods to find connected components in nearest neighbor graphs. } \details{ Note that for kNN graphs, one point may be in the kNN of the other but nor vice versa. \code{mutual = TRUE} requires that both points are in each other's kNN. } \examples{ set.seed(665544) n <- 100 x <- cbind( x=runif(10, 0, 5) + rnorm(n, sd = 0.4), y=runif(10, 0, 5) + rnorm(n, sd = 0.4) ) plot(x, pch = 16) # Connected components on a graph where each pair of points # with a distance less or equal to eps are connected d <- dist(x) components <- comps(d, eps = .8) plot(x, col = components, pch = 16) # Connected components in a fixed radius nearest neighbor graph # Gives the same result as the threshold on the distances above frnn <- frNN(x, eps = .8) components <- comps(frnn) plot(frnn, data = x, col = components) # Connected components on a k nearest neighbors graph knn <- kNN(x, 3) components <- comps(knn, mutual = FALSE) plot(knn, data = x, col = components) components <- comps(knn, mutual = TRUE) plot(knn, data = x, col = components) # Connected components in a shared nearest neighbor graph snn <- sNN(x, k = 10, kt = 5) components <- comps(snn) plot(snn, data = x, col = components) } \seealso{ Other NN functions: \code{\link{NN}}, \code{\link{frNN}()}, \code{\link{kNN}()}, \code{\link{kNNdist}()}, \code{\link{sNN}()} } \author{ Michael Hahsler } \concept{NN functions} \keyword{model} ================================================ FILE: man/dbcv.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/dbcv.R \name{dbcv} \alias{dbcv} \alias{DBCV} \title{Density-Based Clustering Validation Index (DBCV)} \usage{ dbcv(x, cl, d, metric = "euclidean", sample = NULL) } \arguments{ \item{x}{a data matrix or a dist object.} \item{cl}{a clustering (e.g., a integer vector)} \item{d}{dimensionality of the original data if a dist object is provided.} \item{metric}{distance metric used. The available metrics are the methods implemented by \code{dist()} plus \code{"sqeuclidean"} for the squared Euclidean distance used in the original DBCV implementation.} \item{sample}{sample size used for large datasets.} } \value{ A list with the DBCV \code{score} for the clustering, the density sparseness of cluster (\code{dsc}) values, the density separation of pairs of clusters (\code{dspc}) distances, and the validity indices of clusters (\code{c_c}). } \description{ Calculate the Density-Based Clustering Validation Index (DBCV) for a clustering. } \details{ DBCV (Moulavi et al, 2014) computes a score based on the density sparseness of each cluster and the density separation of each pair of clusters. The density sparseness of a cluster (DSC) is defined as the maximum edge weight of a minimal spanning tree for the internal points of the cluster using the mutual reachability distance based on the all-points-core-distance. Internal points are connected to more than one other point in the cluster. Since clusters of a size less then 3 cannot have internal points, they are ignored (considered noise) in this implementation. The density separation of a pair of clusters (DSPC) is defined as the minimum reachability distance between the internal nodes of the spanning trees of the two clusters. The validity index for a cluster is calculated using these measures and aggregated to a validity index for the whole clustering using a weighted average. The index is in the range \eqn{[-1,1]}. If the cluster density compactness is better than the density separation, a positive value is returned. The actual value depends on the separability of the data. In general, greater values of the measure indicating a better density-based clustering solution. Noise points are included in the calculation only in the weighted average, therefore clustering with more noise points will get a lower index. \strong{Performance note:} This implementation calculates a distance matrix and thus can only be used for small or sampled datasets. } \examples{ # Load a test dataset data(Dataset_1) x <- Dataset_1[, c("x", "y")] class <- Dataset_1$class clplot(x, class) # We use MinPts 3 and use the knee at eps = .1 for dbscan kNNdistplot(x, minPts = 3) cl <- dbscan(x, eps = .1, minPts = 3) clplot(x, cl) dbcv(x, cl) # compare to the DBCV index on the original class labels and # with a random partitioning dbcv(x, class) dbcv(x, sample(1:4, replace = TRUE, size = nrow(x))) # find the best eps using dbcv eps_grid <- seq(.05,.2, by = .01) cls <- lapply(eps_grid, FUN = function(e) dbscan(x, eps = e, minPts = 3)) dbcvs <- sapply(cls, FUN = function(cl) dbcv(x, cl)$score) plot(eps_grid, dbcvs, type = "l") eps_opt <- eps_grid[which.max(dbcvs)] eps_opt cl <- dbscan(x, eps = eps_opt, minPts = 3) clplot(x, cl) } \references{ Davoud Moulavi and Pablo A. Jaskowiak and Ricardo J. G. B. Campello and Arthur Zimek and Jörg Sander (2014). Density-Based Clustering Validation. In \emph{Proceedings of the 2014 SIAM International Conference on Data Mining,} pages 839-847 \doi{10.1137/1.9781611973440.96} Pablo A. Jaskowiak (2022). MATLAB implementation of DBCV. \url{https://github.com/pajaskowiak/dbcv} } \author{ Matt Piekenbrock and Michael Hahsler } \concept{Evaluation Functions} ================================================ FILE: man/dbscan-package.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/AAA_dbscan-package.R \docType{package} \name{dbscan-package} \alias{dbscan-package} \title{dbscan: Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Related Algorithms} \description{ A fast reimplementation of several density-based algorithms of the DBSCAN family. Includes the clustering algorithms DBSCAN (density-based spatial clustering of applications with noise) and HDBSCAN (hierarchical DBSCAN), the ordering algorithm OPTICS (ordering points to identify the clustering structure), shared nearest neighbor clustering, and the outlier detection algorithms LOF (local outlier factor) and GLOSH (global-local outlier score from hierarchies). The implementations use the kd-tree data structure (from library ANN) for faster k-nearest neighbor search. An R interface to fast kNN and fixed-radius NN search is also provided. Hahsler, Piekenbrock and Doran (2019) \doi{10.18637/jss.v091.i01}. } \section{Key functions}{ \itemize{ \item Clustering: \code{\link[=dbscan]{dbscan()}}, \code{\link[=hdbscan]{hdbscan()}}, \code{\link[=optics]{optics()}}, \code{\link[=jpclust]{jpclust()}}, \code{\link[=sNNclust]{sNNclust()}} \item Outliers: \code{\link[=lof]{lof()}}, \code{\link[=glosh]{glosh()}}, \code{\link[=pointdensity]{pointdensity()}} \item Nearest Neighbors: \code{\link[=kNN]{kNN()}}, \code{\link[=frNN]{frNN()}}, \code{\link[=sNN]{sNN()}} } } \references{ Hahsler M, Piekenbrock M, Doran D (2019). dbscan: Fast Density-Based Clustering with R. Journal of Statistical Software, 91(1), 1-30. \doi{10.18637/jss.v091.i01} } \seealso{ Useful links: \itemize{ \item \url{https://github.com/mhahsler/dbscan} \item Report bugs at \url{https://github.com/mhahsler/dbscan/issues} } } \author{ \strong{Maintainer}: Michael Hahsler \email{mhahsler@lyle.smu.edu} (\href{https://orcid.org/0000-0003-2716-1405}{ORCID}) [copyright holder] Authors: \itemize{ \item Matthew Piekenbrock [copyright holder] } Other contributors: \itemize{ \item Sunil Arya [contributor, copyright holder] \item David Mount [contributor, copyright holder] \item Claudia Malzer [contributor] } } \keyword{internal} ================================================ FILE: man/dbscan.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/dbscan.R, R/predict.R \name{dbscan} \alias{dbscan} \alias{DBSCAN} \alias{print.dbscan_fast} \alias{is.corepoint} \alias{predict.dbscan_fast} \title{Density-based Spatial Clustering of Applications with Noise (DBSCAN)} \usage{ dbscan(x, eps, minPts = 5, weights = NULL, borderPoints = TRUE, ...) is.corepoint(x, eps, minPts = 5, ...) \method{predict}{dbscan_fast}(object, newdata, data, ...) } \arguments{ \item{x}{a data matrix, a data.frame, a \link{dist} object or a \link{frNN} object with fixed-radius nearest neighbors.} \item{eps}{size (radius) of the epsilon neighborhood. Can be omitted if \code{x} is a frNN object.} \item{minPts}{number of minimum points required in the eps neighborhood for core points (including the point itself).} \item{weights}{numeric; weights for the data points. Only needed to perform weighted clustering.} \item{borderPoints}{logical; should border points be assigned to clusters. The default is \code{TRUE} for regular DBSCAN. If \code{FALSE} then border points are considered noise (see DBSCAN* in Campello et al, 2013).} \item{...}{additional arguments are passed on to the fixed-radius nearest neighbor search algorithm. See \code{\link[=frNN]{frNN()}} for details on how to control the search strategy.} \item{object}{clustering object.} \item{newdata}{new data points for which the cluster membership should be predicted.} \item{data}{the data set used to create the clustering object.} } \value{ \code{dbscan()} returns an object of class \code{dbscan_fast} with the following components: \item{eps }{ value of the \code{eps} parameter.} \item{minPts }{ value of the \code{minPts} parameter.} \item{metric }{ used distance metric.} \item{cluster }{A integer vector with cluster assignments. Zero indicates noise points.} \code{is.corepoint()} returns a logical vector indicating for each data point if it is a core point. } \description{ Fast reimplementation of the DBSCAN (Density-based spatial clustering of applications with noise) clustering algorithm using a kd-tree. } \details{ The implementation is significantly faster and can work with larger data sets than \code{\link[fpc:dbscan]{fpc::dbscan()}} in \pkg{fpc}. Use \code{dbscan::dbscan()} (with specifying the package) to call this implementation when you also load package \pkg{fpc}. \strong{The algorithm} This implementation of DBSCAN follows the original algorithm as described by Ester et al (1996). DBSCAN performs the following steps: \enumerate{ \item Estimate the density around each data point by counting the number of points in a user-specified eps-neighborhood and applies a used-specified minPts thresholds to identify \itemize{ \item core points (points with more than minPts points in their neighborhood), \item border points (non-core points with a core point in their neighborhood) and \item noise points (all other points). } \item Core points form the backbone of clusters by joining them into a cluster if they are density-reachable from each other (i.e., there is a chain of core points where one falls inside the eps-neighborhood of the next). \item Border points are assigned to clusters. The algorithm needs parameters \code{eps} (the radius of the epsilon neighborhood) and \code{minPts} (the density threshold). } Border points are arbitrarily assigned to clusters in the original algorithm. DBSCAN* (see Campello et al 2013) treats all border points as noise points. This is implemented with \code{borderPoints = FALSE}. \strong{Specifying the data} If \code{x} is a matrix or a data.frame, then fast fixed-radius nearest neighbor computation using a kd-tree is performed using Euclidean distance. See \code{\link[=frNN]{frNN()}} for more information on the parameters related to nearest neighbor search. \strong{Note} that only numerical values are allowed in \code{x}. Any precomputed distance matrix (dist object) can be specified as \code{x}. You may run into memory issues since distance matrices are large. A precomputed frNN object can be supplied as \code{x}. In this case \code{eps} does not need to be specified. This option us useful for large data sets, where a sparse distance matrix is available. See \code{\link[=frNN]{frNN()}} how to create frNN objects. \strong{Setting parameters for DBSCAN} The parameters \code{minPts} and \code{eps} define the minimum density required in the area around core points which form the backbone of clusters. \code{minPts} is the number of points required in the neighborhood around the point defined by the parameter \code{eps} (i.e., the radius around the point). Both parameters depend on each other and changing one typically requires changing the other one as well. The parameters also depend on the size of the data set with larger datasets requiring a larger \code{minPts} or a smaller \code{eps}. \itemize{ \item \verb{minPts:} The original DBSCAN paper (Ester et al, 1996) suggests to start by setting \eqn{\text{minPts} \ge d + 1}, the data dimensionality plus one or higher with a minimum of 3. Larger values are preferable since increasing the parameter suppresses more noise in the data by requiring more points to form clusters. Sander et al (1998) uses in the examples two times the data dimensionality. Note that setting \eqn{\text{minPts} \le 2} is equivalent to hierarchical clustering with the single link metric and the dendrogram cut at height \code{eps}. \item \verb{eps:} A suitable neighborhood size parameter \code{eps} given a fixed value for \code{minPts} can be found visually by inspecting the \code{\link[=kNNdistplot]{kNNdistplot()}} of the data using \eqn{k = \text{minPts} - 1} (\code{minPts} includes the point itself, while the k-nearest neighbors distance does not). The k-nearest neighbor distance plot sorts all data points by their k-nearest neighbor distance. A sudden increase of the kNN distance (a knee) indicates that the points to the right are most likely outliers. Choose \code{eps} for DBSCAN where the knee is. } \strong{Predict cluster memberships} \code{\link[=predict]{predict()}} can be used to predict cluster memberships for new data points. A point is considered a member of a cluster if it is within the eps neighborhood of a core point of the cluster. Points which cannot be assigned to a cluster will be reported as noise points (i.e., cluster ID 0). \strong{Important note:} \code{predict()} currently can only use Euclidean distance to determine the neighborhood of core points. If \code{dbscan()} was called using distances other than Euclidean, then the neighborhood calculation will not be correct and only approximated by Euclidean distances. If the data contain factor columns (e.g., using Gower's distance), then the factors in \code{data} and \code{query} first need to be converted to numeric to use the Euclidean approximation. } \examples{ ## Example 1: use dbscan on the iris data set data(iris) iris <- as.matrix(iris[, 1:4]) ## Find suitable DBSCAN parameters: ## 1. We use minPts = dim + 1 = 5 for iris. A larger value can also be used. ## 2. We inspect the k-NN distance plot for k = minPts - 1 = 4 kNNdistplot(iris, minPts = 5) ## Noise seems to start around a 4-NN distance of .7 abline(h=.7, col = "red", lty = 2) ## Cluster with the chosen parameters res <- dbscan(iris, eps = .7, minPts = 5) res pairs(iris, col = res$cluster + 1L) clplot(iris, res) ## Use a precomputed frNN object fr <- frNN(iris, eps = .7) dbscan(fr, minPts = 5) ## Example 2: use data from fpc set.seed(665544) n <- 100 x <- cbind( x = runif(10, 0, 10) + rnorm(n, sd = 0.2), y = runif(10, 0, 10) + rnorm(n, sd = 0.2) ) res <- dbscan(x, eps = .3, minPts = 3) res ## plot clusters and add noise (cluster 0) as crosses. plot(x, col = res$cluster) points(x[res$cluster == 0, ], pch = 3, col = "grey") clplot(x, res) hullplot(x, res) ## Predict cluster membership for new data points ## (Note: 0 means it is predicted as noise) newdata <- x[1:5,] + rnorm(10, 0, .3) hullplot(x, res) points(newdata, pch = 3 , col = "red", lwd = 3) text(newdata, pos = 1) pred_label <- predict(res, newdata, data = x) pred_label points(newdata, col = pred_label + 1L, cex = 2, lwd = 2) ## Compare speed against fpc version (if microbenchmark is installed) ## Note: we use dbscan::dbscan to make sure that we do now run the ## implementation in fpc. \dontrun{ if (requireNamespace("fpc", quietly = TRUE) && requireNamespace("microbenchmark", quietly = TRUE)) { t_dbscan <- microbenchmark::microbenchmark( dbscan::dbscan(x, .3, 3), times = 10, unit = "ms") t_dbscan_linear <- microbenchmark::microbenchmark( dbscan::dbscan(x, .3, 3, search = "linear"), times = 10, unit = "ms") t_dbscan_dist <- microbenchmark::microbenchmark( dbscan::dbscan(x, .3, 3, search = "dist"), times = 10, unit = "ms") t_fpc <- microbenchmark::microbenchmark( fpc::dbscan(x, .3, 3), times = 10, unit = "ms") r <- rbind(t_fpc, t_dbscan_dist, t_dbscan_linear, t_dbscan) r boxplot(r, names = c('fpc', 'dbscan (dist)', 'dbscan (linear)', 'dbscan (kdtree)'), main = "Runtime comparison in ms") ## speedup of the kd-tree-based version compared to the fpc implementation median(t_fpc$time) / median(t_dbscan$time) }} ## Example 3: manually create a frNN object for dbscan (dbscan only needs ids and eps) nn <- structure(list(id = list(c(2,3), c(1,3), c(1,2,3), c(3,5), c(4,5)), eps = 1), class = c("NN", "frNN")) nn dbscan(nn, minPts = 2) } \references{ Hahsler M, Piekenbrock M, Doran D (2019). dbscan: Fast Density-Based Clustering with R. \emph{Journal of Statistical Software,} 91(1), 1-30. \doi{10.18637/jss.v091.i01} Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Institute for Computer Science, University of Munich. \emph{Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96),} 226-231. \url{https://dl.acm.org/doi/10.5555/3001460.3001507} Campello, R. J. G. B.; Moulavi, D.; Sander, J. (2013). Density-Based Clustering Based on Hierarchical Density Estimates. Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery in Databases, PAKDD 2013, \emph{Lecture Notes in Computer Science} 7819, p. 160. \doi{10.1007/978-3-642-37456-2_14} Sander, J., Ester, M., Kriegel, HP. et al. (1998). Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications. \emph{Data Mining and Knowledge Discovery} 2, 169-194. \doi{10.1023/A:1009745219419} } \seealso{ Other clustering functions: \code{\link{extractFOSC}()}, \code{\link{hdbscan}()}, \code{\link{jpclust}()}, \code{\link{ncluster}()}, \code{\link{optics}()}, \code{\link{sNNclust}()} } \author{ Michael Hahsler } \concept{clustering functions} \keyword{clustering} \keyword{model} ================================================ FILE: man/dbscan_tidiers.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/broom-dbscan-tidiers.R \name{dbscan_tidiers} \alias{dbscan_tidiers} \alias{glance} \alias{tidy} \alias{augment} \alias{tidy.dbscan} \alias{tidy.hdbscan} \alias{tidy.general_clustering} \alias{augment.dbscan} \alias{augment.hdbscan} \alias{augment.general_clustering} \alias{glance.dbscan} \alias{glance.hdbscan} \alias{glance.general_clustering} \title{Turn an dbscan clustering object into a tidy tibble} \usage{ tidy(x, ...) \method{tidy}{dbscan}(x, ...) \method{tidy}{hdbscan}(x, ...) \method{tidy}{general_clustering}(x, ...) augment(x, ...) \method{augment}{dbscan}(x, data = NULL, newdata = NULL, ...) \method{augment}{hdbscan}(x, data = NULL, newdata = NULL, ...) \method{augment}{general_clustering}(x, data = NULL, newdata = NULL, ...) glance(x, ...) \method{glance}{dbscan}(x, ...) \method{glance}{hdbscan}(x, ...) \method{glance}{general_clustering}(x, ...) } \arguments{ \item{x}{An \code{dbscan} object returned from \code{\link[=dbscan]{dbscan()}}.} \item{...}{further arguments are ignored without a warning.} \item{data}{The data used to create the clustering.} \item{newdata}{New data to predict cluster labels for.} } \description{ Provides \link[generics:tidy]{tidy()}, \link[generics:augment]{augment()}, and \link[generics:glance]{glance()} verbs for clusterings created with algorithms in package \code{dbscan} to work with \href{https://www.tidymodels.org/}{tidymodels}. } \examples{ \dontshow{if (requireNamespace("tibble", quietly = TRUE) && identical(Sys.getenv("NOT_CRAN"), "true")) withAutoprint(\{ # examplesIf} data(iris) x <- scale(iris[, 1:4]) ## dbscan db <- dbscan(x, eps = .9, minPts = 5) db # summarize model fit with tidiers tidy(db) glance(db) # augment for this model needs the original data augment(db, x) # to augment new data, the original data is also needed augment(db, x, newdata = x[1:5, ]) ## hdbscan hdb <- hdbscan(x, minPts = 5) # summarize model fit with tidiers tidy(hdb) glance(hdb) # augment for this model needs the original data augment(hdb, x) # to augment new data, the original data is also needed augment(hdb, x, newdata = x[1:5, ]) ## Jarvis-Patrick clustering cl <- jpclust(x, k = 20, kt = 15) # summarize model fit with tidiers tidy(cl) glance(cl) # augment for this model needs the original data augment(cl, x) ## Shared Nearest Neighbor clustering cl <- sNNclust(x, k = 20, eps = 0.8, minPts = 15) # summarize model fit with tidiers tidy(cl) glance(cl) # augment for this model needs the original data augment(cl, x) \dontshow{\}) # examplesIf} } \seealso{ \code{\link[generics:tidy]{generics::tidy()}}, \code{\link[generics:augment]{generics::augment()}}, \code{\link[generics:glance]{generics::glance()}}, \code{\link[=dbscan]{dbscan()}} } \concept{tidiers} ================================================ FILE: man/dendrogram.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/dendrogram.R \name{dendrogram} \alias{dendrogram} \alias{as.dendrogram} \alias{as.dendrogram.default} \alias{as.dendrogram.hclust} \alias{as.dendrogram.hdbscan} \alias{as.dendrogram.reachability} \title{Coersions to Dendrogram} \usage{ as.dendrogram(object, ...) \method{as.dendrogram}{default}(object, ...) \method{as.dendrogram}{hclust}(object, ...) \method{as.dendrogram}{hdbscan}(object, ...) \method{as.dendrogram}{reachability}(object, ...) } \arguments{ \item{object}{the object} \item{...}{further arguments} } \description{ Provides a new generic function to coerce objects to dendrograms with \code{\link[stats:dendrogram]{stats::as.dendrogram()}} as the default. Additional methods for \link{hclust}, \link{hdbscan} and \link{reachability} objects are provided. } \details{ Coersion methods for \link{hclust}, \link{hdbscan} and \link{reachability} objects to \link{dendrogram} are provided. The coercion from \code{hclust} is a faster C++ reimplementation of the coercion in package \code{stats}. The original implementation can be called using \code{\link[stats:dendrogram]{stats::as.dendrogram()}}. The coersion from \link{hdbscan} builds the non-simplified HDBSCAN hierarchy as a dendrogram object. } ================================================ FILE: man/extractFOSC.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/extractFOSC.R \name{extractFOSC} \alias{extractFOSC} \title{Framework for the Optimal Extraction of Clusters from Hierarchies} \usage{ extractFOSC( x, constraints, alpha = 0, minPts = 2L, prune_unstable = FALSE, validate_constraints = FALSE ) } \arguments{ \item{x}{a valid \link{hclust} object created via \code{\link[=hclust]{hclust()}} or \code{\link[=hdbscan]{hdbscan()}}.} \item{constraints}{Either a list or matrix of pairwise constraints. If missing, an unsupervised measure of stability is used to make local cuts and extract the optimal clusters. See details.} \item{alpha}{numeric; weight between \eqn{[0, 1]} for mixed-objective semi-supervised extraction. Defaults to 0.} \item{minPts}{numeric; Defaults to 2. Only needed if class-less noise is a valid label in the model.} \item{prune_unstable}{logical; should significantly unstable subtrees be pruned? The default is \code{FALSE} for the original optimal extraction framework (see Campello et al, 2013). See details for what \code{TRUE} implies.} \item{validate_constraints}{logical; should constraints be checked for validity? See details for what are considered valid constraints.} } \value{ A list with the elements: \item{cluster }{A integer vector with cluster assignments. Zero indicates noise points (if any).} \item{hc }{The original \link{hclust} object with additional list elements \code{"stability"}, \code{"constraint"}, and \code{"total"} for the \eqn{n - 1} cluster-wide objective scores from the extraction.} } \description{ Generic reimplementation of the \emph{Framework for Optimal Selection of Clusters} (FOSC; Campello et al, 2013) to extract clusterings from hierarchical clustering (i.e., \link{hclust} objects). Can be parameterized to perform unsupervised cluster extraction through a stability-based measure, or semisupervised cluster extraction through either a constraint-based extraction (with a stability-based tiebreaker) or a mixed (weighted) constraint and stability-based objective extraction. } \details{ Campello et al (2013) suggested a \emph{Framework for Optimal Selection of Clusters} (FOSC) as a framework to make local (non-horizontal) cuts to any cluster tree hierarchy. This function implements the original extraction algorithms as described by the framework for hclust objects. Traditional cluster extraction methods from hierarchical representations (such as \link{hclust} objects) generally rely on global parameters or cutting values which are used to partition a cluster hierarchy into a set of disjoint, flat clusters. This is implemented in R in function \code{\link[stats:cutree]{stats::cutree()}}. Although such methods are widespread, using global parameter settings are inherently limited in that they cannot capture patterns within the cluster hierarchy at varying \emph{local} levels of granularity. Rather than partitioning a hierarchy based on the number of the cluster one expects to find (\eqn{k}) or based on some linkage distance threshold (\eqn{H}), the FOSC proposes that the optimal clusters may exist at varying distance thresholds in the hierarchy. To enable this idea, FOSC requires one parameter (minPts) that represents \emph{the minimum number of points that constitute a valid cluster.} The first step of the FOSC algorithm is to traverse the given cluster hierarchy divisively, recording new clusters at each split if both branches represent more than or equal to minPts. Branches that contain less than minPts points at one or both branches inherit the parent clusters identity. Note that using FOSC, due to the constraint that minPts must be greater than or equal to 2, it is possible that the optimal cluster solution chosen makes local cuts that render parent branches of sizes less than minPts as noise, which are denoted as 0 in the final solution. Traversing the original cluster tree using minPts creates a new, simplified cluster tree that is then post-processed recursively to extract clusters that maximize for each cluster \eqn{C_i}{Ci} the cost function \deqn{\max_{\delta_2, \dots, \delta_k} J = \sum\limits_{i=2}^{k} \delta_i S(C_i)}{ J = \sum \delta S(Ci) for all i clusters, } where \eqn{S(C_i)}{S(Ci)} is the stability-based measure as \deqn{ S(C_i) = \sum_{x_j \in C_i}(\frac{1}{h_{min} (x_j, C_i)} - \frac{1}{h_{max} (C_i)}) }{ S(Ci) = \sum (1/Hmin(Xj, Ci) - 1/Hmax(Ci)) for all Xj in Ci.} \eqn{\delta_i}{\delta} represents an indicator function, which constrains the solution space such that clusters must be disjoint (cannot assign more than 1 label to each cluster). The measure \eqn{S(C_i)}{S(Ci)} used by FOSC is an unsupervised validation measure based on the assumption that, if you vary the linkage/distance threshold across all possible values, more prominent clusters that survive over many threshold variations should be considered as stronger candidates of the optimal solution. For this reason, using this measure to detect clusters is referred to as an unsupervised, \emph{stability-based} extraction approach. In some cases it may be useful to enact \emph{instance-level} constraints that ensure the solution space conforms to linkage expectations known \emph{a priori}. This general idea of using preliminary expectations to augment the clustering solution will be referred to as \emph{semisupervised clustering}. If constraints are given in the call to \code{extractFOSC()}, the following alternative objective function is maximized: \deqn{J = \frac{1}{2n_c}\sum\limits_{j=1}^n \gamma (x_j)}{J = 1/(2 * nc) \sum \gamma(Xj)} \eqn{n_c}{nc} is the total number of constraints given and \eqn{\gamma(x_j)}{\gamma(Xj)} represents the number of constraints involving object \eqn{x_j}{Xj} that are satisfied. In the case of ties (such as solutions where no constraints were given), the unsupervised solution is used as a tiebreaker. See Campello et al (2013) for more details. As a third option, if one wishes to prioritize the degree at which the unsupervised and semisupervised solutions contribute to the overall optimal solution, the parameter \eqn{\alpha} can be set to enable the extraction of clusters that maximize the \code{mixed} objective function \deqn{J = \alpha S(C_i) + (1 - \alpha) \gamma(C_i))}{J = \alpha S(Ci) + (1 - \alpha) \gamma(Ci).} FOSC expects the pairwise constraints to be passed as either 1) an \eqn{n(n-1)/2} vector of integers representing the constraints, where 1 represents should-link, -1 represents should-not-link, and 0 represents no preference using the unsupervised solution (see below for examples). Alternatively, if only a few constraints are needed, a named list representing the (symmetric) adjacency list can be used, where the names correspond to indices of the points in the original data, and the values correspond to integer vectors of constraints (positive indices for should-link, negative indices for should-not-link). Again, see the examples section for a demonstration of this. The parameters to the input function correspond to the concepts discussed above. The \code{minPts} parameter to represent the minimum cluster size to extract. The optional \code{constraints} parameter contains the pairwise, instance-level constraints of the data. The optional \code{alpha} parameters controls whether the mixed objective function is used (if \code{alpha} is greater than 0). If the \code{validate_constraints} parameter is set to true, the constraints are checked (and fixed) for symmetry (if point A has a should-link constraint with point B, point B should also have the same constraint). Asymmetric constraints are not supported. Unstable branch pruning was not discussed by Campello et al (2013), however in some data sets it may be the case that specific subbranches scores are significantly greater than sibling and parent branches, and thus sibling branches should be considered as noise if their scores are cumulatively lower than the parents. This can happen in extremely nonhomogeneous data sets, where there exists locally very stable branches surrounded by unstable branches that contain more than \code{minPts} points. \code{prune_unstable = TRUE} will remove the unstable branches. } \examples{ data("moons") ## Regular HDBSCAN using stability-based extraction (unsupervised) cl <- hdbscan(moons, minPts = 5) cl$cluster ## Constraint-based extraction from the HDBSCAN hierarchy ## (w/ stability-based tiebreaker (semisupervised)) cl_con <- extractFOSC(cl$hc, minPts = 5, constraints = list("12" = c(49, -47))) cl_con$cluster ## Alternative formulation: Constraint-based extraction from the HDBSCAN hierarchy ## (w/ stability-based tiebreaker (semisupervised)) using distance thresholds dist_moons <- dist(moons) cl_con2 <- extractFOSC(cl$hc, minPts = 5, constraints = ifelse(dist_moons < 0.1, 1L, ifelse(dist_moons > 1, -1L, 0L))) cl_con2$cluster # same as the second example } \references{ Campello, Ricardo JGB, Davoud Moulavi, Arthur Zimek, and Joerg Sander (2013). A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies. \emph{Data Mining and Knowledge Discovery} 27(3): 344-371. \doi{10.1007/s10618-013-0311-4} } \seealso{ \code{\link[=hclust]{hclust()}}, \code{\link[=hdbscan]{hdbscan()}}, \code{\link[stats:cutree]{stats::cutree()}} Other clustering functions: \code{\link{dbscan}()}, \code{\link{hdbscan}()}, \code{\link{jpclust}()}, \code{\link{ncluster}()}, \code{\link{optics}()}, \code{\link{sNNclust}()} } \author{ Matt Piekenbrock } \concept{clustering functions} \keyword{clustering} \keyword{model} ================================================ FILE: man/frNN.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/frNN.R \name{frNN} \alias{frNN} \alias{frnn} \alias{print.frnn} \alias{sort.frNN} \alias{adjacencylist.frNN} \alias{print.frNN} \title{Find the Fixed Radius Nearest Neighbors} \usage{ frNN( x, eps, query = NULL, sort = TRUE, search = "kdtree", bucketSize = 10, splitRule = "suggest", approx = 0 ) \method{sort}{frNN}(x, decreasing = FALSE, ...) \method{adjacencylist}{frNN}(x, ...) \method{print}{frNN}(x, ...) } \arguments{ \item{x}{a data matrix, a dist object or a frNN object.} \item{eps}{neighbors radius.} \item{query}{a data matrix with the points to query. If query is not specified, the NN for all the points in \code{x} is returned. If query is specified then \code{x} needs to be a data matrix.} \item{sort}{sort the neighbors by distance? This is expensive and can be done later using \code{sort()}.} \item{search}{nearest neighbor search strategy (one of \code{"kdtree"}, \code{"linear"} or \code{"dist"}).} \item{bucketSize}{max size of the kd-tree leafs.} \item{splitRule}{rule to split the kd-tree. One of \code{"STD"}, \code{"MIDPT"}, \code{"FAIR"}, \code{"SL_MIDPT"}, \code{"SL_FAIR"} or \code{"SUGGEST"} (SL stands for sliding). \code{"SUGGEST"} uses ANNs best guess.} \item{approx}{use approximate nearest neighbors. All NN up to a distance of a factor of \code{1 + approx} eps may be used. Some actual NN may be omitted leading to spurious clusters and noise points. However, the algorithm will enjoy a significant speedup.} \item{decreasing}{sort in decreasing order?} \item{...}{further arguments} } \value{ \code{frNN()} returns an object of class \link{frNN} (subclass of \link{NN}) containing a list with the following components: \item{id }{a list of integer vectors. Each vector contains the ids (row numbers) of the fixed radius nearest neighbors. } \item{dist }{a list with distances (same structure as \code{id}). } \item{eps }{ neighborhood radius \code{eps} that was used. } \item{metric }{ used distance metric. } \code{adjacencylist()} returns a list with one entry per data point in \code{x}. Each entry contains the id of the nearest neighbors. } \description{ This function uses a kd-tree to find the fixed radius nearest neighbors (including distances) fast. } \details{ If \code{x} is specified as a data matrix, then Euclidean distances an fast nearest neighbor lookup using a kd-tree are used. To create a frNN object from scratch, you need to supply at least the elements \code{id} with a list of integer vectors with the nearest neighbor ids for each point and \code{eps} (see below). \strong{Self-matches:} Self-matches are not returned! } \examples{ data(iris) x <- iris[, -5] # Example 1: Find fixed radius nearest neighbors for each point nn <- frNN(x, eps = .5) nn # Number of neighbors hist(lengths(adjacencylist(nn)), xlab = "k", main="Number of Neighbors", sub = paste("Neighborhood size eps =", nn$eps)) # Explore neighbors of point i = 10 i <- 10 nn$id[[i]] nn$dist[[i]] plot(x, col = ifelse(seq_len(nrow(iris)) \%in\% nn$id[[i]], "red", "black")) # get an adjacency list head(adjacencylist(nn)) # plot the fixed radius neighbors (and then reduced to a radius of .3) plot(nn, x) plot(frNN(nn, eps = .3), x) ## Example 2: find fixed-radius NN for query points q <- x[c(1,100),] nn <- frNN(x, eps = .5, query = q) plot(nn, x, col = "grey") points(q, pch = 3, lwd = 2) } \references{ David M. Mount and Sunil Arya (2010). ANN: A Library for Approximate Nearest Neighbor Searching, \url{http://www.cs.umd.edu/~mount/ANN/}. } \seealso{ Other NN functions: \code{\link{NN}}, \code{\link{comps}()}, \code{\link{kNN}()}, \code{\link{kNNdist}()}, \code{\link{sNN}()} } \author{ Michael Hahsler } \concept{NN functions} \keyword{model} ================================================ FILE: man/glosh.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/GLOSH.R \name{glosh} \alias{glosh} \alias{GLOSH} \title{Global-Local Outlier Score from Hierarchies} \usage{ glosh(x, k = 4, ...) } \arguments{ \item{x}{an \link{hclust} object, data matrix, or \link{dist} object.} \item{k}{size of the neighborhood.} \item{...}{further arguments are passed on to \code{\link[=kNN]{kNN()}}.} } \value{ A numeric vector of length equal to the size of the original data set containing GLOSH values for all data points. } \description{ Calculate the Global-Local Outlier Score from Hierarchies (GLOSH) score for each data point using a kd-tree to speed up kNN search. } \details{ GLOSH compares the density of a point to densities of any points associated within current and child clusters (if any). Points that have a substantially lower density than the density mode (cluster) they most associate with are considered outliers. GLOSH is computed from a hierarchy a clusters. Specifically, consider a point \emph{x} and a density or distance threshold \emph{lambda}. GLOSH is calculated by taking 1 minus the ratio of how long any of the child clusters of the cluster \emph{x} belongs to "survives" changes in \emph{lambda} to the highest \emph{lambda} threshold of x, above which x becomes a noise point. Scores close to 1 indicate outliers. For more details on the motivation for this calculation, see Campello et al (2015). } \examples{ set.seed(665544) n <- 100 x <- cbind( x=runif(10, 0, 5) + rnorm(n, sd = 0.4), y=runif(10, 0, 5) + rnorm(n, sd = 0.4) ) ### calculate GLOSH score glosh <- glosh(x, k = 3) ### distribution of outlier scores summary(glosh) hist(glosh, breaks = 10) ### simple function to plot point size is proportional to GLOSH score plot_glosh <- function(x, glosh){ plot(x, pch = ".", main = "GLOSH (k = 3)") points(x, cex = glosh*3, pch = 1, col = "red") text(x[glosh > 0.80, ], labels = round(glosh, 3)[glosh > 0.80], pos = 3) } plot_glosh(x, glosh) ### GLOSH with any hierarchy x_dist <- dist(x) x_sl <- hclust(x_dist, method = "single") x_upgma <- hclust(x_dist, method = "average") x_ward <- hclust(x_dist, method = "ward.D2") ## Compare what different linkage criterion consider as outliers glosh_sl <- glosh(x_sl, k = 3) plot_glosh(x, glosh_sl) glosh_upgma <- glosh(x_upgma, k = 3) plot_glosh(x, glosh_upgma) glosh_ward <- glosh(x_ward, k = 3) plot_glosh(x, glosh_ward) ## GLOSH is automatically computed with HDBSCAN all(hdbscan(x, minPts = 3)$outlier_scores == glosh(x, k = 3)) } \references{ Campello, Ricardo JGB, Davoud Moulavi, Arthur Zimek, and Joerg Sander. Hierarchical density estimates for data clustering, visualization, and outlier detection. \emph{ACM Transactions on Knowledge Discovery from Data (TKDD)} 10, no. 1 (2015). \doi{10.1145/2733381} } \seealso{ Other Outlier Detection Functions: \code{\link{kNNdist}()}, \code{\link{lof}()}, \code{\link{pointdensity}()} } \author{ Matt Piekenbrock } \concept{Outlier Detection Functions} \keyword{model} ================================================ FILE: man/hdbscan.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/hdbscan.R, R/predict.R \name{hdbscan} \alias{hdbscan} \alias{HDBSCAN} \alias{print.hdbscan} \alias{plot.hdbscan} \alias{coredist} \alias{mrdist} \alias{predict.hdbscan} \title{Hierarchical DBSCAN (HDBSCAN)} \usage{ hdbscan( x, minPts, cluster_selection_epsilon = 0, gen_hdbscan_tree = FALSE, gen_simplified_tree = FALSE, verbose = FALSE ) \method{print}{hdbscan}(x, ...) \method{plot}{hdbscan}( x, scale = "suggest", gradient = c("yellow", "red"), show_flat = FALSE, main = "HDBSCAN*", ylab = "eps value", leaflab = "none", ... ) coredist(x, minPts) mrdist(x, minPts, coredist = NULL) \method{predict}{hdbscan}(object, newdata, data, ...) } \arguments{ \item{x}{a data matrix (Euclidean distances are used) or a \link{dist} object calculated with an arbitrary distance metric.} \item{minPts}{integer; Minimum size of clusters. See details.} \item{cluster_selection_epsilon}{double; a distance threshold below which} \item{gen_hdbscan_tree}{logical; should the robust single linkage tree be explicitly computed (see cluster tree in Chaudhuri et al, 2010).} \item{gen_simplified_tree}{logical; should the simplified hierarchy be explicitly computed (see Campello et al, 2013).} \item{verbose}{report progress.} \item{...}{additional arguments are passed on.} \item{scale}{integer; used to scale condensed tree based on the graphics device. Lower scale results in wider colored trees lines. The default \code{'suggest'} sets scale to the number of clusters.} \item{gradient}{character vector; the colors to build the condensed tree coloring with.} \item{show_flat}{logical; whether to draw boxes indicating the most stable clusters.} \item{main}{Title of the plot.} \item{ylab}{the label for the y axis.} \item{leaflab}{a string specifying how leaves are labeled (see \code{\link[stats:dendrogram]{stats::plot.dendrogram()}}).} \item{coredist}{numeric vector with precomputed core distances (optional).} \item{object}{clustering object.} \item{newdata}{new data points for which the cluster membership should be predicted.} \item{data}{the data set used to create the clustering object.} } \value{ \code{hdbscan()} returns object of class \code{hdbscan} with the following components: \item{cluster }{A integer vector with cluster assignments. Zero indicates noise points.} \item{minPts }{ value of the \code{minPts} parameter.} \item{cluster_scores }{The sum of the stability scores for each salient (flat) cluster. Corresponds to cluster IDs given the in \code{"cluster"} element. } \item{membership_prob }{The probability or individual stability of a point within its clusters. Between 0 and 1.} \item{outlier_scores }{The GLOSH outlier score of each point. } \item{hc }{An \link{hclust} object of the HDBSCAN hierarchy. } \code{coredist()} returns a vector with the core distance for each data point. \code{mrdist()} returns a \link{dist} object containing pairwise mutual reachability distances. } \description{ Fast C++ implementation of the HDBSCAN (Hierarchical DBSCAN) and its related algorithms. } \details{ This fast implementation of HDBSCAN (Campello et al., 2013) computes the hierarchical cluster tree representing density estimates along with the stability-based flat cluster extraction. HDBSCAN essentially computes the hierarchy of all DBSCAN* clusterings, and then uses a stability-based extraction method to find optimal cuts in the hierarchy, thus producing a flat solution. HDBSCAN performs the following steps: \enumerate{ \item Compute mutual reachability distance mrd between points (based on distances and core distances). \item Use mdr as a distance measure to construct a minimum spanning tree. \item Prune the tree using stability. \item Extract the clusters. } Additional, related algorithms including the "Global-Local Outlier Score from Hierarchies" (GLOSH; see section 6 of Campello et al., 2015) is available in function \code{\link[=glosh]{glosh()}} and the ability to cluster based on instance-level constraints (see section 5.3 of Campello et al. 2015) are supported. The algorithms only need the parameter \code{minPts}. Note that \code{minPts} not only acts as a minimum cluster size to detect, but also as a "smoothing" factor of the density estimates implicitly computed from HDBSCAN. When using the optional parameter \code{cluster_selection_epsilon}, a combination between DBSCAN* and HDBSCAN* can be achieved (see Malzer & Baum 2020). This means that part of the tree is affected by \code{cluster_selection_epsilon} as if running DBSCAN* with \code{eps} = \code{cluster_selection_epsilon}. The remaining part (on levels above the threshold) is still processed by HDBSCAN*'s stability-based selection algorithm and can therefore return clusters of variable densities. Note that there is not always a remaining part, especially if the parameter value is chosen too large, or if there aren't enough clusters of variable densities. In this case, the result will be equal to DBSCAN*. where HDBSCAN* produces too many small clusters that need to be merged, while still being able to extract clusters of variable densities at higher levels. \code{coredist()}: The core distance is defined for each point as the distance to the \code{MinPts - 1}'s neighbor. It is a density estimate equivalent to \code{kNNdist()} with \code{k = MinPts -1}. \code{mrdist()}: The mutual reachability distance is defined between two points as \code{mrd(a, b) = max(coredist(a), coredist(b), dist(a, b))}. This distance metric is used by HDBSCAN. It has the effect of increasing distances in low density areas. \code{predict()} assigns each new data point to the same cluster as the nearest point if it is not more than that points core distance away. Otherwise the new point is classified as a noise point (i.e., cluster ID 0). } \examples{ ## cluster the moons data set with HDBSCAN data(moons) res <- hdbscan(moons, minPts = 5) res plot(res) clplot(moons, res) ## cluster the moons data set with HDBSCAN using Manhattan distances res <- hdbscan(dist(moons, method = "manhattan"), minPts = 5) plot(res) clplot(moons, res) ## Example for HDBSCAN(e) using cluster_selection_epsilon # data with clusters of various densities. X <- data.frame( x = c( 0.08, 0.46, 0.46, 2.95, 3.50, 1.49, 6.89, 6.87, 0.21, 0.15, 0.15, 0.39, 0.80, 0.80, 0.37, 3.63, 0.35, 0.30, 0.64, 0.59, 1.20, 1.22, 1.42, 0.95, 2.70, 6.36, 6.36, 6.36, 6.60, 0.04, 0.71, 0.57, 0.24, 0.24, 0.04, 0.04, 1.35, 0.82, 1.04, 0.62, 0.26, 5.98, 1.67, 1.67, 0.48, 0.15, 6.67, 6.67, 1.20, 0.21, 3.99, 0.12, 0.19, 0.15, 6.96, 0.26, 0.08, 0.30, 1.04, 1.04, 1.04, 0.62, 0.04, 0.04, 0.04, 0.82, 0.82, 1.29, 1.35, 0.46, 0.46, 0.04, 0.04, 5.98, 5.98, 6.87, 0.37, 6.47, 6.47, 6.47, 6.67, 0.30, 1.49, 3.21, 3.21, 0.75, 0.75, 0.46, 0.46, 0.46, 0.46, 3.63, 0.39, 3.65, 4.09, 4.01, 3.36, 1.43, 3.28, 5.94, 6.35, 6.87, 5.60, 5.99, 0.12, 0.00, 0.32, 0.39, 0.00, 1.63, 1.36, 5.67, 5.60, 5.79, 1.10, 2.99, 0.39, 0.18 ), y = c( 7.41, 8.01, 8.01, 5.44, 7.11, 7.13, 1.83, 1.83, 8.22, 8.08, 8.08, 7.20, 7.83, 7.83, 8.29, 5.99, 8.32, 8.22, 7.38, 7.69, 8.22, 7.31, 8.25, 8.39, 6.34, 0.16, 0.16, 0.16, 1.66, 7.55, 7.90, 8.18, 8.32, 8.32, 7.97, 7.97, 8.15, 8.43, 7.83, 8.32, 8.29, 1.03, 7.27, 7.27, 8.08, 7.27, 0.79, 0.79, 8.22, 7.73, 6.62, 7.62, 8.39, 8.36, 1.73, 8.29, 8.04, 8.22, 7.83, 7.83, 7.83, 8.32, 8.11, 7.69, 7.55, 7.20, 7.20, 8.01, 8.15, 7.55, 7.55, 7.97, 7.97, 1.03, 1.03, 1.24, 7.20, 0.47, 0.47, 0.47, 0.79, 8.22, 7.13, 6.48, 6.48, 7.10, 7.10, 8.01, 8.01, 8.01, 8.01, 5.99, 8.04, 5.22, 5.82, 5.14, 4.81, 7.62, 5.73, 0.55, 1.31, 0.05, 0.95, 1.59, 7.99, 7.48, 8.38, 7.12, 2.01, 1.40, 0.00, 9.69, 9.47, 9.25, 2.63, 6.89, 0.56, 3.11 ) ) ## HDBSCAN splits one cluster hdb <- hdbscan(X, minPts = 3) plot(hdb, show_flat = TRUE) hullplot(X, hdb, main = "HDBSCAN") ## DBSCAN* marks the least dense cluster as outliers db <- dbscan(X, eps = 1, minPts = 3, borderPoints = FALSE) hullplot(X, db, main = "DBSCAN*") ## HDBSCAN(e) mixes HDBSCAN AND DBSCAN* to find all clusters hdbe <- hdbscan(X, minPts = 3, cluster_selection_epsilon = 1) plot(hdbe, show_flat = TRUE) hullplot(X, hdbe, main = "HDBSCAN(e)") } \references{ Campello RJGB, Moulavi D, Sander J (2013). Density-Based Clustering Based on Hierarchical Density Estimates. Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery in Databases, PAKDD 2013, \emph{Lecture Notes in Computer Science} 7819, p. 160. \doi{10.1007/978-3-642-37456-2_14} Campello RJGB, Moulavi D, Zimek A, Sander J (2015). Hierarchical density estimates for data clustering, visualization, and outlier detection. \emph{ACM Transactions on Knowledge Discovery from Data (TKDD),} 10(5):1-51. \doi{10.1145/2733381} Malzer, C., & Baum, M. (2020). A Hybrid Approach To Hierarchical Density-based Cluster Selection. In 2020 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI), pp. 223-228. \doi{10.1109/MFI49285.2020.9235263} } \seealso{ Other clustering functions: \code{\link{dbscan}()}, \code{\link{extractFOSC}()}, \code{\link{jpclust}()}, \code{\link{ncluster}()}, \code{\link{optics}()}, \code{\link{sNNclust}()} } \author{ Matt Piekenbrock Claudia Malzer (added cluster_selection_epsilon) } \concept{HDBSCAN functions} \concept{clustering functions} \keyword{clustering} \keyword{hierarchical} \keyword{model} ================================================ FILE: man/hullplot.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/hullplot.R \name{hullplot} \alias{hullplot} \alias{clplot} \title{Plot Clusters} \usage{ hullplot( x, cl, col = NULL, pch = NULL, cex = 0.5, hull_lwd = 1, hull_lty = 1, solid = TRUE, alpha = 0.2, main = "Convex Cluster Hulls", ... ) clplot(x, cl, col = NULL, pch = NULL, cex = 0.5, main = "Cluster Plot", ...) } \arguments{ \item{x}{a data matrix. If more than 2 columns are provided, then the data is plotted using the first two principal components.} \item{cl}{a clustering. Either a numeric cluster assignment vector or a clustering object (a list with an element named \code{cluster}).} \item{col}{colors used for clusters. Defaults to the standard palette. The first color (default is black) is used for noise/unassigned points (cluster id 0).} \item{pch}{a vector of plotting characters. By default \code{o} is used for points and \code{x} for noise points.} \item{cex}{expansion factor for symbols.} \item{hull_lwd, hull_lty}{line width and line type used for the convex hull.} \item{solid, alpha}{draw filled polygons instead of just lines for the convex hulls? alpha controls the level of alpha shading.} \item{main}{main title.} \item{...}{additional arguments passed on to plot.} } \description{ This function produces a two-dimensional scatter plot of data points and colors the data points according to a supplied clustering. Noise points are marked as \code{x}. \code{hullplot()} also adds convex hulls to clusters. } \examples{ set.seed(2) n <- 400 x <- cbind( x = runif(4, 0, 1) + rnorm(n, sd = 0.1), y = runif(4, 0, 1) + rnorm(n, sd = 0.1) ) cl <- rep(1:4, times = 100) ### original data with true clustering clplot(x, cl, main = "True clusters") hullplot(x, cl, main = "True clusters") ### use different symbols hullplot(x, cl, main = "True clusters", pch = cl) ### just the hulls hullplot(x, cl, main = "True clusters", pch = NA) ### a version suitable for b/w printing) hullplot(x, cl, main = "True clusters", solid = FALSE, col = c("grey", "black"), pch = cl) ### run some clustering algorithms and plot the results db <- dbscan(x, eps = .07, minPts = 10) clplot(x, db, main = "DBSCAN") hullplot(x, db, main = "DBSCAN") op <- optics(x, eps = 10, minPts = 10) opDBSCAN <- extractDBSCAN(op, eps_cl = .07) hullplot(x, opDBSCAN, main = "OPTICS") opXi <- extractXi(op, xi = 0.05) hullplot(x, opXi, main = "OPTICSXi") # Extract minimal 'flat' clusters only opXi <- extractXi(op, xi = 0.05, minimum = TRUE) hullplot(x, opXi, main = "OPTICSXi") km <- kmeans(x, centers = 4) hullplot(x, km, main = "k-means") hc <- cutree(hclust(dist(x)), k = 4) hullplot(x, hc, main = "Hierarchical Clustering") } \author{ Michael Hahsler } \keyword{clustering} \keyword{plot} ================================================ FILE: man/jpclust.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/jpclust.R \name{jpclust} \alias{jpclust} \alias{print.general_clustering} \title{Jarvis-Patrick Clustering} \usage{ jpclust(x, k, kt, ...) } \arguments{ \item{x}{a data matrix/data.frame (Euclidean distance is used), a precomputed \link{dist} object or a kNN object created with \code{\link[=kNN]{kNN()}}.} \item{k}{Neighborhood size for nearest neighbor sparsification. If \code{x} is a kNN object then \code{k} may be missing.} \item{kt}{threshold on the number of shared nearest neighbors (including the points themselves) to form clusters. Range: \eqn{[1, k]}} \item{...}{additional arguments are passed on to the k nearest neighbor search algorithm. See \code{\link[=kNN]{kNN()}} for details on how to control the search strategy.} } \value{ A object of class \code{general_clustering} with the following components: \item{cluster }{A integer vector with cluster assignments. Zero indicates noise points.} \item{type }{ name of used clustering algorithm.} \item{metric }{ the distance metric used for clustering.} \item{param }{ list of used clustering parameters. } } \description{ Fast C++ implementation of the Jarvis-Patrick clustering which first builds a shared nearest neighbor graph (k nearest neighbor sparsification) and then places two points in the same cluster if they are in each others nearest neighbor list and they share at least kt nearest neighbors. } \details{ Following the original paper, the shared nearest neighbor list is constructed as the k neighbors plus the point itself (as neighbor zero). Therefore, the threshold \code{kt} needs to be in the range \eqn{[1, k]}. Fast nearest neighbors search with \code{\link[=kNN]{kNN()}} is only used if \code{x} is a matrix. In this case Euclidean distance is used. } \examples{ data("DS3") # use a shared neighborhood of 20 points and require 12 shared neighbors cl <- jpclust(DS3, k = 20, kt = 12) cl clplot(DS3, cl) # Note: JP clustering does not consider noise and thus, # the sine wave points chain clusters together. # use a precomputed kNN object instead of the original data. nn <- kNN(DS3, k = 30) nn cl <- jpclust(nn, k = 20, kt = 12) cl # cluster with noise removed (use low pointdensity to identify noise) d <- pointdensity(DS3, eps = 25) hist(d, breaks = 20) DS3_noiseless <- DS3[d > 110,] cl <- jpclust(DS3_noiseless, k = 20, kt = 10) cl clplot(DS3_noiseless, cl) } \references{ R. A. Jarvis and E. A. Patrick. 1973. Clustering Using a Similarity Measure Based on Shared Near Neighbors. \emph{IEEE Trans. Comput. 22,} 11 (November 1973), 1025-1034. \doi{10.1109/T-C.1973.223640} } \seealso{ Other clustering functions: \code{\link{dbscan}()}, \code{\link{extractFOSC}()}, \code{\link{hdbscan}()}, \code{\link{ncluster}()}, \code{\link{optics}()}, \code{\link{sNNclust}()} } \author{ Michael Hahsler } \concept{clustering functions} \keyword{clustering} \keyword{model} ================================================ FILE: man/kNN.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/kNN.R \name{kNN} \alias{kNN} \alias{knn} \alias{sort.kNN} \alias{adjacencylist.kNN} \alias{print.kNN} \title{Find the k Nearest Neighbors} \usage{ kNN( x, k, query = NULL, sort = TRUE, search = "kdtree", bucketSize = 10, splitRule = "suggest", approx = 0 ) \method{sort}{kNN}(x, decreasing = FALSE, ...) \method{adjacencylist}{kNN}(x, ...) \method{print}{kNN}(x, ...) } \arguments{ \item{x}{a data matrix, a \link{dist} object or a \link{kNN} object.} \item{k}{number of neighbors to find.} \item{query}{a data matrix with the points to query. If query is not specified, the NN for all the points in \code{x} is returned. If query is specified then \code{x} needs to be a data matrix.} \item{sort}{sort the neighbors by distance? Note that some search methods already sort the results. Sorting is expensive and \code{sort = FALSE} may be much faster for some search methods. kNN objects can be sorted using \code{sort()}.} \item{search}{nearest neighbor search strategy (one of \code{"kdtree"}, \code{"linear"} or \code{"dist"}).} \item{bucketSize}{max size of the kd-tree leafs.} \item{splitRule}{rule to split the kd-tree. One of \code{"STD"}, \code{"MIDPT"}, \code{"FAIR"}, \code{"SL_MIDPT"}, \code{"SL_FAIR"} or \code{"SUGGEST"} (SL stands for sliding). \code{"SUGGEST"} uses ANNs best guess.} \item{approx}{use approximate nearest neighbors. All NN up to a distance of a factor of \code{1 + approx} eps may be used. Some actual NN may be omitted leading to spurious clusters and noise points. However, the algorithm will enjoy a significant speedup.} \item{decreasing}{sort in decreasing order?} \item{...}{further arguments} } \value{ An object of class \code{kNN} (subclass of \link{NN}) containing a list with the following components: \item{dist }{a matrix with distances. } \item{id }{a matrix with \code{ids}. } \item{k }{number \code{k} used. } \item{metric }{ used distance metric. } } \description{ This function uses a kd-tree to find all k nearest neighbors in a data matrix (including distances) fast. } \details{ \strong{Ties:} If the kth and the (k+1)th nearest neighbor are tied, then the neighbor found first is returned and the other one is ignored. \strong{Self-matches:} If no query is specified, then self-matches are removed. Details on the search parameters: \itemize{ \item \code{search} controls if a kd-tree or linear search (both implemented in the ANN library; see Mount and Arya, 2010). Note, that these implementations cannot handle NAs. \code{search = "dist"} precomputes Euclidean distances first using R. NAs are handled, but the resulting distance matrix cannot contain NAs. To use other distance measures, a precomputed distance matrix can be provided as \code{x} (\code{search} is ignored). \item \code{bucketSize} and \code{splitRule} influence how the kd-tree is built. \code{approx} uses the approximate nearest neighbor search implemented in ANN. All nearest neighbors up to a distance of \code{eps / (1 + approx)} will be considered and all with a distance greater than \code{eps} will not be considered. The other points might be considered. Note that this results in some actual nearest neighbors being omitted leading to spurious clusters and noise points. However, the algorithm will enjoy a significant speedup. For more details see Mount and Arya (2010). } } \examples{ data(iris) x <- iris[, -5] # Example 1: finding kNN for all points in a data matrix (using a kd-tree) nn <- kNN(x, k = 5) nn # explore neighborhood of point 10 i <- 10 nn$id[i,] plot(x, col = ifelse(seq_len(nrow(iris)) \%in\% nn$id[i,], "red", "black")) # visualize the 5 nearest neighbors plot(nn, x) # visualize a reduced 2-NN graph plot(kNN(nn, k = 2), x) # Example 2: find kNN for query points q <- x[c(1,100),] nn <- kNN(x, k = 10, query = q) plot(nn, x, col = "grey") points(q, pch = 3, lwd = 2) # Example 3: find kNN using distances d <- dist(x, method = "manhattan") nn <- kNN(d, k = 1) plot(nn, x) } \references{ David M. Mount and Sunil Arya (2010). ANN: A Library for Approximate Nearest Neighbor Searching, \url{http://www.cs.umd.edu/~mount/ANN/}. } \seealso{ Other NN functions: \code{\link{NN}}, \code{\link{comps}()}, \code{\link{frNN}()}, \code{\link{kNNdist}()}, \code{\link{sNN}()} } \author{ Michael Hahsler } \concept{NN functions} \keyword{model} ================================================ FILE: man/kNNdist.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/kNNdist.R \name{kNNdist} \alias{kNNdist} \alias{kNNdistplot} \title{Calculate and Plot k-Nearest Neighbor Distances} \usage{ kNNdist(x, k, all = FALSE, ...) kNNdistplot(x, k, minPts, ...) } \arguments{ \item{x}{the data set as a matrix of points (Euclidean distance is used) or a precalculated \link{dist} object.} \item{k}{number of nearest neighbors used for the distance calculation. For \code{kNNdistplot()} also a range of values for \code{k} or \code{minPts} can be specified.} \item{all}{should a matrix with the distances to all k nearest neighbors be returned?} \item{...}{further arguments (e.g., kd-tree related parameters) are passed on to \code{\link[=kNN]{kNN()}}.} \item{minPts}{to use a k-NN plot to determine a suitable \code{eps} value for \code{\link[=dbscan]{dbscan()}}, \code{minPts} used in dbscan can be specified and will set \code{k = minPts - 1}.} } \value{ \code{kNNdist()} returns a numeric vector with the distance to its k nearest neighbor. If \code{all = TRUE} then a matrix with k columns containing the distances to all 1st, 2nd, ..., kth nearest neighbors is returned instead. } \description{ Fast calculation of the k-nearest neighbor distances for a dataset represented as a matrix of points. The kNN distance is defined as the distance from a point to its k nearest neighbor. The kNN distance plot displays the kNN distance of all points sorted from smallest to largest. The plot can be used to help find suitable parameter values for \code{\link[=dbscan]{dbscan()}}. } \examples{ data(iris) iris <- as.matrix(iris[, 1:4]) ## Find the 4-NN distance for each observation (see ?kNN ## for different search strategies) kNNdist(iris, k = 4) ## Get a matrix with distances to the 1st, 2nd, ..., 4th NN. kNNdist(iris, k = 4, all = TRUE) ## Produce a k-NN distance plot to determine a suitable eps for ## DBSCAN with MinPts = 5. Use k = 4 (= MinPts -1). ## The knee is visible around a distance of .7 kNNdistplot(iris, k = 4) ## Look at all k-NN distance plots for a k of 1 to 10 ## Note that k-NN distances are increasing in k kNNdistplot(iris, k = 1:20) cl <- dbscan(iris, eps = .7, minPts = 5) pairs(iris, col = cl$cluster + 1L) ## Note: black points are noise points } \seealso{ Other Outlier Detection Functions: \code{\link{glosh}()}, \code{\link{lof}()}, \code{\link{pointdensity}()} Other NN functions: \code{\link{NN}}, \code{\link{comps}()}, \code{\link{frNN}()}, \code{\link{kNN}()}, \code{\link{sNN}()} } \author{ Michael Hahsler } \concept{NN functions} \concept{Outlier Detection Functions} \keyword{model} \keyword{plot} ================================================ FILE: man/lof.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/LOF.R \name{lof} \alias{lof} \alias{LOF} \title{Local Outlier Factor Score} \usage{ lof(x, minPts = 5, ...) } \arguments{ \item{x}{a data matrix or a \link{dist} object.} \item{minPts}{number of nearest neighbors used in defining the local neighborhood of a point (includes the point itself).} \item{...}{further arguments are passed on to \code{\link[=kNN]{kNN()}}. Note: \code{sort} cannot be specified here since \code{lof()} uses always \code{sort = TRUE}.} } \value{ A numeric vector of length \code{ncol(x)} containing LOF values for all data points. } \description{ Calculate the Local Outlier Factor (LOF) score for each data point using a kd-tree to speed up kNN search. } \details{ LOF compares the local readability density (lrd) of an point to the lrd of its neighbors. A LOF score of approximately 1 indicates that the lrd around the point is comparable to the lrd of its neighbors and that the point is not an outlier. Points that have a substantially lower lrd than their neighbors are considered outliers and produce scores significantly larger than 1. If a data matrix is specified, then Euclidean distances and fast nearest neighbor search using a kd-tree is used. \strong{Note on duplicate points:} If there are more than \code{minPts} duplicates of a point in the data, then LOF the local readability distance will be 0 resulting in an undefined LOF score of 0/0. We set LOF in this case to 1 since there is already enough density from the points in the same location to make them not outliers. The original paper by Breunig et al (2000) assumes that the points are real duplicates and suggests to remove the duplicates before computing LOF. If duplicate points are removed first, then this LOF implementation in \pkg{dbscan} behaves like the one described by Breunig et al. } \examples{ set.seed(665544) n <- 100 x <- cbind( x=runif(10, 0, 5) + rnorm(n, sd = 0.4), y=runif(10, 0, 5) + rnorm(n, sd = 0.4) ) ### calculate LOF score with a neighborhood of 3 points lof <- lof(x, minPts = 3) ### distribution of outlier factors summary(lof) hist(lof, breaks = 10, main = "LOF (minPts = 3)") ### plot sorted lof. Looks like outliers start arounf a LOF of 2. plot(sort(lof), type = "l", main = "LOF (minPts = 3)", xlab = "Points sorted by LOF", ylab = "LOF") ### point size is proportional to LOF and mark points with a LOF > 2 plot(x, pch = ".", main = "LOF (minPts = 3)", asp = 1) points(x, cex = (lof - 1) * 2, pch = 1, col = "red") text(x[lof > 2,], labels = round(lof, 1)[lof > 2], pos = 3) } \references{ Breunig, M., Kriegel, H., Ng, R., and Sander, J. (2000). LOF: identifying density-based local outliers. In \emph{ACM Int. Conf. on Management of Data,} pages 93-104. \doi{10.1145/335191.335388} } \seealso{ Other Outlier Detection Functions: \code{\link{glosh}()}, \code{\link{kNNdist}()}, \code{\link{pointdensity}()} } \author{ Michael Hahsler } \concept{Outlier Detection Functions} \keyword{model} ================================================ FILE: man/moons.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/moons.R \docType{data} \name{moons} \alias{moons} \title{Moons Data} \format{ A data frame with 100 observations on the following 2 variables. \describe{ \item{X}{a numeric vector} \item{Y}{a numeric vector} } } \source{ See the HDBSCAN notebook from github documentation: \url{http://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html} } \description{ Contains 100 2-d points, half of which are contained in two moons or "blobs"" (25 points each blob), and the other half in asymmetric facing crescent shapes. The three shapes are all linearly separable. } \details{ This data was generated with the following Python commands using the SciKit-Learn library: \verb{> import sklearn.datasets as data} \verb{> moons = data.make_moons(n_samples=50, noise=0.05)} \verb{> blobs = data.make_blobs(n_samples=50, centers=[(-0.75,2.25), (1.0, 2.0)], cluster_std=0.25)} \verb{> test_data = np.vstack([moons, blobs])} } \examples{ data(moons) plot(moons, pch=20) } \references{ Pedregosa, Fabian, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel et al. Scikit-learn: Machine learning in Python. \emph{Journal of Machine Learning Research} 12, no. Oct (2011): 2825-2830. } \keyword{datasets} ================================================ FILE: man/ncluster.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/ncluster.R \name{ncluster} \alias{ncluster} \alias{nnoise} \alias{nobs} \title{Number of Clusters, Noise Points, and Observations} \usage{ ncluster(object, ...) nnoise(object, ...) } \arguments{ \item{object}{a clustering result object containing a \code{cluster} element.} \item{...}{additional arguments are unused.} } \value{ returns the number if clusters or noise points. } \description{ Extract the number of clusters or the number of noise points for a clustering. This function works with any clustering result that contains a list element named \code{cluster} with a clustering vector. In addition, \code{nobs} (see \code{\link[stats:nobs]{stats::nobs()}}) is also available to retrieve the number of clustered points. } \examples{ data(iris) iris <- as.matrix(iris[, 1:4]) res <- dbscan(iris, eps = .7, minPts = 5) res ncluster(res) nnoise(res) nobs(res) # the functions also work with kmeans and other clustering algorithms. cl <- kmeans(iris, centers = 3) ncluster(cl) nnoise(cl) nobs(res) } \seealso{ Other clustering functions: \code{\link{dbscan}()}, \code{\link{extractFOSC}()}, \code{\link{hdbscan}()}, \code{\link{jpclust}()}, \code{\link{optics}()}, \code{\link{sNNclust}()} } \concept{clustering functions} ================================================ FILE: man/optics.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/optics.R, R/predict.R \name{optics} \alias{optics} \alias{OPTICS} \alias{print.optics} \alias{plot.optics} \alias{as.reachability.optics} \alias{as.dendrogram.optics} \alias{extractDBSCAN} \alias{extractXi} \alias{predict.optics} \title{Ordering Points to Identify the Clustering Structure (OPTICS)} \usage{ optics(x, eps = NULL, minPts = 5, ...) \method{print}{optics}(x, ...) \method{plot}{optics}(x, cluster = TRUE, predecessor = FALSE, ...) \method{as.reachability}{optics}(object, ...) \method{as.dendrogram}{optics}(object, ...) extractDBSCAN(object, eps_cl) extractXi(object, xi, minimum = FALSE, correctPredecessors = TRUE) \method{predict}{optics}(object, newdata, data, ...) } \arguments{ \item{x}{a data matrix or a \link{dist} object.} \item{eps}{upper limit of the size of the epsilon neighborhood. Limiting the neighborhood size improves performance and has no or very little impact on the ordering as long as it is not set too low. If not specified, the largest minPts-distance in the data set is used which gives the same result as infinity.} \item{minPts}{the parameter is used to identify dense neighborhoods and the reachability distance is calculated as the distance to the minPts nearest neighbor. Controls the smoothness of the reachability distribution. Default is 5 points.} \item{...}{additional arguments are passed on to fixed-radius nearest neighbor search algorithm. See \code{\link[=frNN]{frNN()}} for details on how to control the search strategy.} \item{cluster, predecessor}{plot clusters and predecessors.} \item{object}{clustering object.} \item{eps_cl}{Threshold to identify clusters (\code{eps_cl <= eps}).} \item{xi}{Steepness threshold to identify clusters hierarchically using the Xi method.} \item{minimum}{logical, representing whether or not to extract the minimal (non-overlapping) clusters in the Xi clustering algorithm.} \item{correctPredecessors}{logical, correct a common artifact by pruning the steep up area for points that have predecessors not in the cluster--found by the ELKI framework, see details below.} \item{newdata}{new data points for which the cluster membership should be predicted.} \item{data}{the data set used to create the clustering object.} } \value{ An object of class \code{optics} with components: \item{eps }{ value of \code{eps} parameter. } \item{minPts }{ value of \code{minPts} parameter. } \item{order }{ optics order for the data points in \code{x}. } \item{reachdist }{ \link{reachability} distance for each data point in \code{x}. } \item{coredist }{ core distance for each data point in \code{x}. } For \code{extractDBSCAN()}, in addition the following components are available: \item{eps_cl }{ the value of the \code{eps_cl} parameter. } \item{cluster }{ assigned cluster labels in the order of the data points in \code{x}. } For \code{extractXi()}, in addition the following components are available: \item{xi}{ Steepness threshold\code{x}. } \item{cluster }{ assigned cluster labels in the order of the data points in \code{x}.} \item{clusters_xi }{ data.frame containing the start and end of each cluster found in the OPTICS ordering. } } \description{ Implementation of the OPTICS (Ordering points to identify the clustering structure) point ordering algorithm using a kd-tree. } \details{ \strong{The algorithm} This implementation of OPTICS implements the original algorithm as described by Ankerst et al (1999). OPTICS is an ordering algorithm with methods to extract a clustering from the ordering. While using similar concepts as DBSCAN, for OPTICS \code{eps} is only an upper limit for the neighborhood size used to reduce computational complexity. Note that \code{minPts} in OPTICS has a different effect then in DBSCAN. It is used to define dense neighborhoods, but since \code{eps} is typically set rather high, this does not effect the ordering much. However, it is also used to calculate the reachability distance and larger values will make the reachability distance plot smoother. OPTICS linearly orders the data points such that points which are spatially closest become neighbors in the ordering. The closest analog to this ordering is dendrogram in single-link hierarchical clustering. The algorithm also calculates the reachability distance for each point. \code{plot()} (see \link{reachability_plot}) produces a reachability plot which shows each points reachability distance between two consecutive points where the points are sorted by OPTICS. Valleys represent clusters (the deeper the valley, the more dense the cluster) and high points indicate points between clusters. \strong{Specifying the data} If \code{x} is specified as a data matrix, then Euclidean distances and fast nearest neighbor lookup using a kd-tree are used. See \code{\link[=kNN]{kNN()}} for details on the parameters for the kd-tree. \strong{Extracting a clustering} Several methods to extract a clustering from the order returned by OPTICS are implemented: \itemize{ \item \code{extractDBSCAN()} extracts a clustering from an OPTICS ordering that is similar to what DBSCAN would produce with an eps set to \code{eps_cl} (see Ankerst et al, 1999). The only difference to a DBSCAN clustering is that OPTICS is not able to assign some border points and reports them instead as noise. \item \code{extractXi()} extract clusters hierarchically specified in Ankerst et al (1999) based on the steepness of the reachability plot. One interpretation of the \code{xi} parameter is that it classifies clusters by change in relative cluster density. The used algorithm was originally contributed by the ELKI framework and is explained in Schubert et al (2018), but contains a set of fixes. } \strong{Predict cluster memberships} \code{predict()} requires an extracted DBSCAN clustering with \code{extractDBSCAN()} and then uses predict for \code{dbscan()}. } \examples{ set.seed(2) n <- 400 x <- cbind( x = runif(4, 0, 1) + rnorm(n, sd = 0.1), y = runif(4, 0, 1) + rnorm(n, sd = 0.1) ) plot(x, col=rep(1:4, times = 100)) ### run OPTICS (Note: we use the default eps calculation) res <- optics(x, minPts = 10) res ### get order res$order ### plot produces a reachability plot plot(res) ### plot the order of points in the reachability plot plot(x, col = "grey") polygon(x[res$order, ]) ### extract a DBSCAN clustering by cutting the reachability plot at eps_cl res <- extractDBSCAN(res, eps_cl = .065) res plot(res) ## black is noise hullplot(x, res) ### re-cut at a higher eps threshold res <- extractDBSCAN(res, eps_cl = .07) res plot(res) hullplot(x, res) ### extract hierarchical clustering of varying density using the Xi method res <- extractXi(res, xi = 0.01) res plot(res) hullplot(x, res) # Xi cluster structure res$clusters_xi ### use OPTICS on a precomputed distance matrix d <- dist(x) res <- optics(d, minPts = 10) plot(res) } \references{ Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Joerg Sander (1999). OPTICS: Ordering Points To Identify the Clustering Structure. \emph{ACM SIGMOD international conference on Management of data.} ACM Press. pp. \doi{10.1145/304181.304187} Hahsler M, Piekenbrock M, Doran D (2019). dbscan: Fast Density-Based Clustering with R. \emph{Journal of Statistical Software}, 91(1), 1-30. \doi{10.18637/jss.v091.i01} Erich Schubert, Michael Gertz (2018). Improving the Cluster Structure Extracted from OPTICS Plots. In \emph{Lernen, Wissen, Daten, Analysen (LWDA 2018),} pp. 318-329. } \seealso{ Density \link{reachability}. Other clustering functions: \code{\link{dbscan}()}, \code{\link{extractFOSC}()}, \code{\link{hdbscan}()}, \code{\link{jpclust}()}, \code{\link{ncluster}()}, \code{\link{sNNclust}()} } \author{ Michael Hahsler and Matthew Piekenbrock } \concept{clustering functions} \keyword{clustering} \keyword{model} ================================================ FILE: man/pointdensity.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/pointdensity.R \name{pointdensity} \alias{pointdensity} \alias{density} \title{Calculate Local Density at Each Data Point} \usage{ pointdensity( x, eps, type = "frequency", search = "kdtree", bucketSize = 10, splitRule = "suggest", approx = 0 ) } \arguments{ \item{x}{a data matrix or a dist object.} \item{eps}{radius of the eps-neighborhood, i.e., bandwidth of the uniform kernel). For the Gaussian kde, this parameter specifies the standard deviation of the kernel.} \item{type}{\code{"frequency"}, \code{"density"}, or \code{"gaussian"}. should the raw count of points inside the eps-neighborhood, the eps-neighborhood density estimate, or a Gaussian density estimate be returned?} \item{search, bucketSize, splitRule, approx}{algorithmic parameters for \code{\link[=frNN]{frNN()}}.} } \value{ A vector of the same length as data points (rows) in \code{x} with the count or density values for each data point. } \description{ Calculate the local density at each data point as either the number of points in the eps-neighborhood (as used in \code{dbscan()}) or perform kernel density estimation (KDE) using a uniform kernel. The function uses a kd-tree for fast fixed-radius nearest neighbor search. } \details{ \code{dbscan()} estimates the density around a point as the number of points in the eps-neighborhood of the point (including the query point itself). Kernel density estimation (KDE) using a uniform kernel, which is just this point count in the eps-neighborhood divided by \eqn{(2\,eps\,n)}{(2 eps n)}, where \eqn{n} is the number of points in \code{x}. Alternatively, \code{type = "gaussian"} calculates a Gaussian kernel estimate where \code{eps} is used as the standard deviation. To speed up computation, a kd-tree is used to find all points within 3 times the standard deviation and these points are used for the estimate. Points with low local density often indicate noise (see e.g., Wishart (1969) and Hartigan (1975)). } \examples{ set.seed(665544) n <- 100 x <- cbind( x=runif(10, 0, 5) + rnorm(n, sd = 0.4), y=runif(10, 0, 5) + rnorm(n, sd = 0.4) ) plot(x) ### calculate density around points d <- pointdensity(x, eps = .5, type = "density") ### density distribution summary(d) hist(d, breaks = 10) ### plot with point size is proportional to Density plot(x, pch = 19, main = "Density (eps = .5)", cex = d*5) ### Wishart (1969) single link clustering after removing low-density noise # 1. remove noise with low density f <- pointdensity(x, eps = .5, type = "frequency") x_nonoise <- x[f >= 5,] # 2. use single-linkage on the non-noise points hc <- hclust(dist(x_nonoise), method = "single") plot(x, pch = 19, cex = .5) points(x_nonoise, pch = 19, col= cutree(hc, k = 4) + 1L) } \references{ Wishart, D. (1969), Mode Analysis: A Generalization of Nearest Neighbor which Reduces Chaining Effects, in \emph{Numerical Taxonomy,} Ed., A.J. Cole, Academic Press, 282-311. John A. Hartigan (1975), \emph{Clustering Algorithms,} John Wiley & Sons, Inc., New York, NY, USA. } \seealso{ \code{\link[=frNN]{frNN()}}, \code{\link[stats:density]{stats::density()}}. Other Outlier Detection Functions: \code{\link{glosh}()}, \code{\link{kNNdist}()}, \code{\link{lof}()} } \author{ Michael Hahsler } \concept{Outlier Detection Functions} \keyword{model} ================================================ FILE: man/reachability.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/reachability.R \name{reachability} \alias{reachability} \alias{reachability_plot} \alias{print.reachability} \alias{plot.reachability} \alias{as.reachability} \alias{as.reachability.dendrogram} \title{Reachability Distances} \usage{ \method{print}{reachability}(x, ...) \method{plot}{reachability}( x, order_labels = FALSE, xlab = "Order", ylab = "Reachability dist.", main = "Reachability Plot", ... ) as.reachability(object, ...) \method{as.reachability}{dendrogram}(object, ...) } \arguments{ \item{x}{object of class \code{reachability}.} \item{...}{graphical parameters are passed on to \code{plot()}, or arguments for other methods.} \item{order_labels}{whether to plot text labels for each points reachability distance.} \item{xlab}{x-axis label.} \item{ylab}{y-axis label.} \item{main}{Title of the plot.} \item{object}{any object that can be coerced to class \code{reachability}, such as an object of class \link{optics} or \link[stats:dendrogram]{stats::dendrogram}.} } \value{ An object of class \code{reachability} with components: \item{order }{order to use for the data points in \code{x}. } \item{reachdist }{reachability distance for each data point in \code{x}. } } \description{ Reachability distances can be plotted to show the hierarchical relationships between data points. The idea was originally introduced by Ankerst et al (1999) for \link{OPTICS}. Later, Sanders et al (2003) showed that the visualization is useful for other hierarchical structures and introduced an algorithm to convert \link{dendrogram} representation to reachability plots. } \details{ A reachability plot displays the points as vertical bars, were the height is the reachability distance between two consecutive points. The central idea behind reachability plots is that the ordering in which points are plotted identifies underlying hierarchical density representation as mountains and valleys of high and low reachability distance. The original ordering algorithm OPTICS as described by Ankerst et al (1999) introduced the notion of reachability plots. OPTICS linearly orders the data points such that points which are spatially closest become neighbors in the ordering. Valleys represent clusters, which can be represented hierarchically. Although the ordering is crucial to the structure of the reachability plot, its important to note that OPTICS, like DBSCAN, is not entirely deterministic and, just like the dendrogram, isomorphisms may exist Reachability plots were shown to essentially convey the same information as the more traditional dendrogram structure by Sanders et al (2003). An dendrograms can be converted into reachability plots. Different hierarchical representations, such as dendrograms or reachability plots, may be preferable depending on the context. In smaller datasets, cluster memberships may be more easily identifiable through a dendrogram representation, particularly is the user is already familiar with tree-like representations. For larger datasets however, a reachability plot may be preferred for visualizing macro-level density relationships. A variety of cluster extraction methods have been proposed using reachability plots. Because both cluster extraction depend directly on the ordering OPTICS produces, they are part of the \code{\link[=optics]{optics()}} interface. Nonetheless, reachability plots can be created directly from other types of linkage trees, and vice versa. \emph{Note:} The reachability distance for the first point is by definition not defined (it has no preceding point). Also, the reachability distances can be undefined when a point does not have enough neighbors in the epsilon neighborhood. We represent these undefined cases as \code{Inf} and represent them in the plot as a dashed line. } \examples{ set.seed(2) n <- 20 x <- cbind( x = runif(4, 0, 1) + rnorm(n, sd = 0.1), y = runif(4, 0, 1) + rnorm(n, sd = 0.1) ) plot(x, xlim = range(x), ylim = c(min(x) - sd(x), max(x) + sd(x)), pch = 20) text(x = x, labels = seq_len(nrow(x)), pos = 3) ### run OPTICS res <- optics(x, eps = 10, minPts = 2) res ### plot produces a reachability plot. plot(res) ### Manually extract reachability components from OPTICS reach <- as.reachability(res) reach ### plot still produces a reachability plot; points ids ### (rows in the original data) can be displayed with order_labels = TRUE plot(reach, order_labels = TRUE) ### Reachability objects can be directly converted to dendrograms dend <- as.dendrogram(reach) dend plot(dend) ### A dendrogram can be converted back into a reachability object plot(as.reachability(dend)) } \references{ Ankerst, M., M. M. Breunig, H.-P. Kriegel, J. Sander (1999). OPTICS: Ordering Points To Identify the Clustering Structure. \emph{ACM SIGMOD international conference on Management of data.} ACM Press. pp. 49--60. Sander, J., X. Qin, Z. Lu, N. Niu, and A. Kovarsky (2003). Automatic extraction of clusters from hierarchical clustering representations. \emph{Pacific-Asia Conference on Knowledge Discovery and Data Mining.} Springer Berlin Heidelberg. } \seealso{ \code{\link[=optics]{optics()}}, \code{\link[=as.dendrogram]{as.dendrogram()}}, and \code{\link[stats:hclust]{stats::hclust()}}. } \author{ Matthew Piekenbrock } \keyword{clustering} \keyword{hierarchical} \keyword{model} ================================================ FILE: man/sNN.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/sNN.R \name{sNN} \alias{sNN} \alias{snn} \alias{sort.sNN} \alias{print.sNN} \title{Find Shared Nearest Neighbors} \usage{ sNN( x, k, kt = NULL, jp = FALSE, sort = TRUE, search = "kdtree", bucketSize = 10, splitRule = "suggest", approx = 0 ) \method{sort}{sNN}(x, decreasing = TRUE, ...) \method{print}{sNN}(x, ...) } \arguments{ \item{x}{a data matrix, a \link{dist} object or a \link{kNN} object.} \item{k}{number of neighbors to consider to calculate the shared nearest neighbors.} \item{kt}{minimum threshold on the number of shared nearest neighbors to build the shared nearest neighbor graph. Edges are only preserved if \code{kt} or more neighbors are shared.} \item{jp}{In regular sNN graphs, two points that are not neighbors can have shared neighbors. Javis and Patrick (1973) requires the two points to be neighbors, otherwise the count is zeroed out. \code{TRUE} uses this behavior.} \item{sort}{sort by the number of shared nearest neighbors? Note that this is expensive and \code{sort = FALSE} is much faster. sNN objects can be sorted using \code{sort()}.} \item{search}{nearest neighbor search strategy (one of \code{"kdtree"}, \code{"linear"} or \code{"dist"}).} \item{bucketSize}{max size of the kd-tree leafs.} \item{splitRule}{rule to split the kd-tree. One of \code{"STD"}, \code{"MIDPT"}, \code{"FAIR"}, \code{"SL_MIDPT"}, \code{"SL_FAIR"} or \code{"SUGGEST"} (SL stands for sliding). \code{"SUGGEST"} uses ANNs best guess.} \item{approx}{use approximate nearest neighbors. All NN up to a distance of a factor of \verb{(1 + approx) eps} may be used. Some actual NN may be omitted leading to spurious clusters and noise points. However, the algorithm will enjoy a significant speedup.} \item{decreasing}{logical; sort in decreasing order?} \item{...}{additional parameters are passed on.} } \value{ An object of class \code{sNN} (subclass of \link{kNN} and \link{NN}) containing a list with the following components: \item{id }{a matrix with ids. } \item{dist}{a matrix with the distances. } \item{shared }{a matrix with the number of shared nearest neighbors. } \item{k }{number of \code{k} used. } \item{metric }{the used distance metric. } } \description{ Calculates the number of shared nearest neighbors and creates a shared nearest neighbors graph. } \details{ The number of shared nearest neighbors of two points p and q is the intersection of the kNN neighborhood of two points. Note: that each point is considered to be part of its own kNN neighborhood. The range for the shared nearest neighbors is \eqn{[0, k]}. The result is a n-by-k matrix called \code{shared}. Each row is a point and the columns are the point's k nearest neighbors. The value is the count of the shared neighbors. The shared nearest neighbor graph connects a point with all its nearest neighbors if they have at least one shared neighbor. The number of shared neighbors can be used as an edge weight. Javis and Patrick (1973) use a slightly modified (see parameter \code{jp}) shared nearest neighbor graph for clustering. } \examples{ data(iris) x <- iris[, -5] # finding kNN and add the number of shared nearest neighbors. k <- 5 nn <- sNN(x, k = k) nn # shared nearest neighbor distribution table(as.vector(nn$shared)) # explore number of shared points for the k-neighborhood of point 10 i <- 10 nn$shared[i,] plot(nn, x) # apply a threshold to create a sNN graph with edges # if more than 3 neighbors are shared. nn_3 <- sNN(nn, kt = 3) plot(nn_3, x) # get an adjacency list for the shared nearest neighbor graph adjacencylist(nn_3) } \references{ R. A. Jarvis and E. A. Patrick. 1973. Clustering Using a Similarity Measure Based on Shared Near Neighbors. \emph{IEEE Trans. Comput.} 22, 11 (November 1973), 1025-1034. \doi{10.1109/T-C.1973.223640} } \seealso{ Other NN functions: \code{\link{NN}}, \code{\link{comps}()}, \code{\link{frNN}()}, \code{\link{kNN}()}, \code{\link{kNNdist}()} } \author{ Michael Hahsler } \concept{NN functions} \keyword{model} ================================================ FILE: man/sNNclust.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/sNNclust.R \name{sNNclust} \alias{sNNclust} \alias{snnclust} \title{Shared Nearest Neighbor Clustering} \usage{ sNNclust(x, k, eps, minPts, borderPoints = TRUE, ...) } \arguments{ \item{x}{a data matrix/data.frame (Euclidean distance is used), a precomputed \link{dist} object or a kNN object created with \code{\link[=kNN]{kNN()}}.} \item{k}{Neighborhood size for nearest neighbor sparsification to create the shared NN graph.} \item{eps}{Two objects are only reachable from each other if they share at least \code{eps} nearest neighbors. Note: this is different from the \code{eps} in DBSCAN!} \item{minPts}{minimum number of points that share at least \code{eps} nearest neighbors for a point to be considered a core points.} \item{borderPoints}{should border points be assigned to clusters like in \link{DBSCAN}?} \item{...}{additional arguments are passed on to the k nearest neighbor search algorithm. See \code{\link[=kNN]{kNN()}} for details on how to control the search strategy.} } \value{ A object of class \code{general_clustering} with the following components: \item{cluster }{A integer vector with cluster assignments. Zero indicates noise points.} \item{type }{ name of used clustering algorithm.} \item{param }{ list of used clustering parameters. } } \description{ Implements the shared nearest neighbor clustering algorithm by Ertoz, Steinbach and Kumar (2003). } \details{ \strong{Algorithm:} \enumerate{ \item Constructs a shared nearest neighbor graph for a given k. The edge weights are the number of shared k nearest neighbors (in the range of \eqn{[0, k]}). \item Find each points SNN density, i.e., the number of points which have a similarity of \code{eps} or greater. \item Find the core points, i.e., all points that have an SNN density greater than \code{MinPts}. \item Form clusters from the core points and assign border points (i.e., non-core points which share at least \code{eps} neighbors with a core point). } Note that steps 2-4 are equivalent to the DBSCAN algorithm (see \code{\link[=dbscan]{dbscan()}}) and that \code{eps} has a different meaning than for DBSCAN. Here it is a threshold on the number of shared neighbors (see \code{\link[=sNN]{sNN()}}) which defines a similarity. } \examples{ data("DS3") # Out of k = 20 NN 7 (eps) have to be shared to create a link in the sNN graph. # A point needs a least 16 (minPts) links in the sNN graph to be a core point. # Noise points have cluster id 0 and are shown in black. cl <- sNNclust(DS3, k = 20, eps = 7, minPts = 16) cl clplot(DS3, cl) } \references{ Levent Ertoz, Michael Steinbach, Vipin Kumar, Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data, \emph{SIAM International Conference on Data Mining,} 2003, 47-59. \doi{10.1137/1.9781611972733.5} } \seealso{ Other clustering functions: \code{\link{dbscan}()}, \code{\link{extractFOSC}()}, \code{\link{hdbscan}()}, \code{\link{jpclust}()}, \code{\link{ncluster}()}, \code{\link{optics}()} } \author{ Michael Hahsler } \concept{clustering functions} \keyword{clustering} \keyword{model} ================================================ FILE: src/ANN/ANN.cpp ================================================ //---------------------------------------------------------------------- // File: ANN.cpp // Programmer: Sunil Arya and David Mount // Description: Methods for ANN.h and ANNx.h // Last modified: 01/04/05 (Version 1.0) //---------------------------------------------------------------------- // Copyright (c) 1997-2005 University of Maryland and Sunil Arya and // David Mount. All Rights Reserved. // // This software and related documentation is part of the Approximate // Nearest Neighbor Library (ANN). This software is provided under // the provisions of the Lesser GNU Public License (LGPL). See the // file ../ReadMe.txt for further information. // // The University of Maryland (U.M.) and the authors make no // representations about the suitability or fitness of this software for // any purpose. It is provided "as is" without express or implied // warranty. //---------------------------------------------------------------------- // History: // Revision 0.1 03/04/98 // Initial release // Revision 1.0 04/01/05 // Added performance counting to annDist() // Modified 2/28/08 // Added cstdlib and std:: //---------------------------------------------------------------------- #include #include "ANNx.h" // all ANN include #include "ANNperf.h" // ANN performance //using namespace std; // make std:: accessible #include //---------------------------------------------------------------------- // Point methods //---------------------------------------------------------------------- //---------------------------------------------------------------------- // Distance utility. // (Note: In the nearest neighbor search, most distances are // computed using partial distance calculations, not this // procedure.) //---------------------------------------------------------------------- ANNdist annDist( // interpoint squared distance int dim, ANNpoint p, ANNpoint q) { int d; ANNcoord diff; ANNcoord dist; dist = 0; for (d = 0; d < dim; d++) { diff = p[d] - q[d]; dist = ANN_SUM(dist, ANN_POW(diff)); } ANN_FLOP(3*dim) // performance counts ANN_PTS(1) ANN_COORD(dim) return dist; } //---------------------------------------------------------------------- // annPrintPoint() prints a point to a given output stream. //---------------------------------------------------------------------- void annPrintPt( // print a point ANNpoint pt, // the point int dim, // the dimension std::ostream &out) // output stream { for (int j = 0; j < dim; j++) { out << pt[j]; if (j < dim-1) out << " "; } } //---------------------------------------------------------------------- // Point allocation/deallocation: // // Because points (somewhat like strings in C) are stored // as pointers. Consequently, creating and destroying // copies of points may require storage allocation. These // procedures do this. // // annAllocPt() and annDeallocPt() allocate a deallocate // storage for a single point, and return a pointer to it. // // annAllocPts() allocates an array of points as well a place // to store their coordinates, and initializes the points to // point to their respective coordinates. It allocates point // storage in a contiguous block large enough to store all the // points. It performs no initialization. // // annDeallocPts() should only be used on point arrays allocated // by annAllocPts since it assumes that points are allocated in // a block. // // annCopyPt() copies a point taking care to allocate storage // for the new point. // // annAssignRect() assigns the coordinates of one rectangle to // another. The two rectangles must have the same dimension // (and it is not possible to test this here). //---------------------------------------------------------------------- ANNpoint annAllocPt(int dim, ANNcoord c) // allocate 1 point { ANNpoint p = new ANNcoord[dim]; for (int i = 0; i < dim; i++) p[i] = c; return p; } ANNpointArray annAllocPts(int n, int dim) // allocate n pts in dim { ANNpointArray pa = new ANNpoint[n]; // allocate points ANNpoint p = new ANNcoord[n*dim]; // allocate space for coords for (int i = 0; i < n; i++) { pa[i] = &(p[i*dim]); } return pa; } void annDeallocPt(ANNpoint &p) // deallocate 1 point { delete [] p; p = NULL; } void annDeallocPts(ANNpointArray &pa) // deallocate points { delete [] pa[0]; // dealloc coordinate storage delete [] pa; // dealloc points pa = NULL; } ANNpoint annCopyPt(int dim, ANNpoint source) // copy point { ANNpoint p = new ANNcoord[dim]; for (int i = 0; i < dim; i++) p[i] = source[i]; return p; } // assign one rect to another void annAssignRect(int dim, ANNorthRect &dest, const ANNorthRect &source) { for (int i = 0; i < dim; i++) { dest.lo[i] = source.lo[i]; dest.hi[i] = source.hi[i]; } } // is point inside rectangle? ANNbool ANNorthRect::inside(int dim, ANNpoint p) { for (int i = 0; i < dim; i++) { if (p[i] < lo[i] || p[i] > hi[i]) return ANNfalse; } return ANNtrue; } //---------------------------------------------------------------------- // Error handler //---------------------------------------------------------------------- void annError(const char *msg, ANNerr level) { if (level == ANNabort) { //cerr << "ANN: ERROR------->" << msg << "<-------------ERROR\n"; Rprintf("ANN Fatal ERROR: %s", msg); // std::exit(1); } else { //cerr << "ANN: WARNING----->" << msg << "<-------------WARNING\n"; Rprintf("ANN WARNING: %s", msg); } } //---------------------------------------------------------------------- // Limit on number of points visited // We have an option for terminating the search early if the // number of points visited exceeds some threshold. If the // threshold is 0 (its default) this means there is no limit // and the algorithm applies its normal termination condition. // This is for applications where there are real time constraints // on the running time of the algorithm. //---------------------------------------------------------------------- int ANNmaxPtsVisited = 0; // maximum number of pts visited int ANNptsVisited; // number of pts visited in search //---------------------------------------------------------------------- // Global function declarations //---------------------------------------------------------------------- void annMaxPtsVisit( // set limit on max. pts to visit in search int maxPts) // the limit { ANNmaxPtsVisited = maxPts; } ================================================ FILE: src/ANN/ANN.h ================================================ //---------------------------------------------------------------------- // File: ANN.h // Programmer: Sunil Arya and David Mount // Last modified: 05/03/05 (Release 1.1) // Description: Basic include file for approximate nearest // neighbor searching. //---------------------------------------------------------------------- // Copyright (c) 1997-2005 University of Maryland and Sunil Arya and // David Mount. All Rights Reserved. // // This software and related documentation is part of the Approximate // Nearest Neighbor Library (ANN). This software is provided under // the provisions of the Lesser GNU Public License (LGPL). See the // file ../ReadMe.txt for further information. // // The University of Maryland (U.M.) and the authors make no // representations about the suitability or fitness of this software for // any purpose. It is provided "as is" without express or implied // warranty. //---------------------------------------------------------------------- // History: // Revision 0.1 03/04/98 // Initial release // Revision 1.0 04/01/05 // Added copyright and revision information // Added ANNcoordPrec for coordinate precision. // Added methods theDim, nPoints, maxPoints, thePoints to ANNpointSet. // Cleaned up C++ structure for modern compilers // Revision 1.1 05/03/05 // Added fixed-radius k-NN searching //---------------------------------------------------------------------- //---------------------------------------------------------------------- // ANN - approximate nearest neighbor searching // ANN is a library for approximate nearest neighbor searching, // based on the use of standard and priority search in kd-trees // and balanced box-decomposition (bbd) trees. Here are some // references to the main algorithmic techniques used here: // // kd-trees: // Friedman, Bentley, and Finkel, ``An algorithm for finding // best matches in logarithmic expected time,'' ACM // Transactions on Mathematical Software, 3(3):209-226, 1977. // // Priority search in kd-trees: // Arya and Mount, ``Algorithms for fast vector quantization,'' // Proc. of DCC '93: Data Compression Conference, eds. J. A. // Storer and M. Cohn, IEEE Press, 1993, 381-390. // // Approximate nearest neighbor search and bbd-trees: // Arya, Mount, Netanyahu, Silverman, and Wu, ``An optimal // algorithm for approximate nearest neighbor searching,'' // 5th Ann. ACM-SIAM Symposium on Discrete Algorithms, // 1994, 573-582. //---------------------------------------------------------------------- #ifndef ANN_H #define ANN_H #ifdef Win32 //---------------------------------------------------------------------- // For Microsoft Visual C++, externally accessible symbols must be // explicitly indicated with DLL_API, which is somewhat like "extern." // // The following ifdef block is the standard way of creating macros // which make exporting from a DLL simpler. All files within this DLL // are compiled with the DLL_EXPORTS preprocessor symbol defined on the // command line. In contrast, projects that use (or import) the DLL // objects do not define the DLL_EXPORTS symbol. This way any other // project whose source files include this file see DLL_API functions as // being imported from a DLL, wheras this DLL sees symbols defined with // this macro as being exported. //---------------------------------------------------------------------- #ifdef DLL_EXPORTS #define DLL_API __declspec(dllexport) #else #define DLL_API __declspec(dllimport) #endif //---------------------------------------------------------------------- // DLL_API is ignored for all other systems //---------------------------------------------------------------------- #else #define DLL_API #endif //---------------------------------------------------------------------- // basic includes //---------------------------------------------------------------------- #include // math includes #include // I/O streams #include //---------------------------------------------------------------------- // Limits // There are a number of places where we use the maximum double value as // default initializers (and others may be used, depending on the // data/distance representation). These can usually be found in limits.h // (as LONG_MAX, INT_MAX) or in float.h (as DBL_MAX, FLT_MAX). // // Not all systems have these files. If you are using such a system, // you should set the preprocessor symbol ANN_NO_LIMITS_H when // compiling, and modify the statements below to generate the // appropriate value. For practical purposes, this does not need to be // the maximum double value. It is sufficient that it be at least as // large than the maximum squared distance between between any two // points. //---------------------------------------------------------------------- #ifdef ANN_NO_LIMITS_H // limits.h unavailable #include // replacement for limits.h const double ANN_DBL_MAX = MAXDOUBLE; // insert maximum double #else #include #include const double ANN_DBL_MAX = DBL_MAX; #endif #define ANNversion "1.0" // ANN version and information #define ANNversionCmt "" #define ANNcopyright "David M. Mount and Sunil Arya" #define ANNlatestRev "Mar 1, 2005" //---------------------------------------------------------------------- // ANNbool // This is a simple boolean type. Although ANSI C++ is supposed // to support the type bool, some compilers do not have it. //---------------------------------------------------------------------- enum ANNbool {ANNfalse = 0, ANNtrue = 1}; // ANN boolean type (non ANSI C++) //---------------------------------------------------------------------- // ANNcoord, ANNdist // ANNcoord and ANNdist are the types used for representing // point coordinates and distances. They can be modified by the // user, with some care. It is assumed that they are both numeric // types, and that ANNdist is generally of an equal or higher type // from ANNcoord. A variable of type ANNdist should be large // enough to store the sum of squared components of a variable // of type ANNcoord for the number of dimensions needed in the // application. For example, the following combinations are // legal: // // ANNcoord ANNdist // --------- ------------------------------- // short short, int, long, float, double // int int, long, float, double // long long, float, double // float float, double // double double // // It is the user's responsibility to make sure that overflow does // not occur in distance calculation. //---------------------------------------------------------------------- typedef double ANNcoord; // coordinate data type typedef double ANNdist; // distance data type //---------------------------------------------------------------------- // ANNidx // ANNidx is a point index. When the data structure is built, the // points are given as an array. Nearest neighbor results are // returned as an integer index into this array. To make it // clearer when this is happening, we define the integer type // ANNidx. Indexing starts from 0. // // For fixed-radius near neighbor searching, it is possible that // there are not k nearest neighbors within the search radius. To // indicate this, the algorithm returns ANN_NULL_IDX as its result. // It should be distinguishable from any valid array index. //---------------------------------------------------------------------- typedef int ANNidx; // point index const ANNidx ANN_NULL_IDX = -1; // a NULL point index //---------------------------------------------------------------------- // Infinite distance: // The code assumes that there is an "infinite distance" which it // uses to initialize distances before performing nearest neighbor // searches. It should be as larger or larger than any legitimate // nearest neighbor distance. // // On most systems, these should be found in the standard include // file or possibly . If you do not have these // file, some suggested values are listed below, assuming 64-bit // long, 32-bit int and 16-bit short. // // ANNdist ANN_DIST_INF Values (see or ) // ------- ------------ ------------------------------------ // double DBL_MAX 1.79769313486231570e+308 // float FLT_MAX 3.40282346638528860e+38 // long LONG_MAX 0x7fffffffffffffff // int INT_MAX 0x7fffffff // short SHRT_MAX 0x7fff //---------------------------------------------------------------------- const ANNdist ANN_DIST_INF = ANN_DBL_MAX; //---------------------------------------------------------------------- // Significant digits for tree dumps: // When floating point coordinates are used, the routine that dumps // a tree needs to know roughly how many significant digits there // are in a ANNcoord, so it can output points to full precision. // This is defined to be ANNcoordPrec. On most systems these // values can be found in the standard include files or // . For integer types, the value is essentially ignored. // // ANNcoord ANNcoordPrec Values (see or ) // -------- ------------ ------------------------------------ // double DBL_DIG 15 // float FLT_DIG 6 // long doesn't matter 19 // int doesn't matter 10 // short doesn't matter 5 //---------------------------------------------------------------------- #ifdef DBL_DIG // number of sig. bits in ANNcoord const int ANNcoordPrec = DBL_DIG; #else const int ANNcoordPrec = 15; // default precision #endif //---------------------------------------------------------------------- // Self match? // In some applications, the nearest neighbor of a point is not // allowed to be the point itself. This occurs, for example, when // computing all nearest neighbors in a set. By setting the // parameter ANN_ALLOW_SELF_MATCH to ANNfalse, the nearest neighbor // is the closest point whose distance from the query point is // strictly positive. //---------------------------------------------------------------------- const ANNbool ANN_ALLOW_SELF_MATCH = ANNtrue; //const ANNbool ANN_ALLOW_SELF_MATCH = ANNfalse; //---------------------------------------------------------------------- // Norms and metrics: // ANN supports any Minkowski norm for defining distance. In // particular, for any p >= 1, the L_p Minkowski norm defines the // length of a d-vector (v0, v1, ..., v(d-1)) to be // // (|v0|^p + |v1|^p + ... + |v(d-1)|^p)^(1/p), // // (where ^ denotes exponentiation, and |.| denotes absolute // value). The distance between two points is defined to be the // norm of the vector joining them. Some common distance metrics // include // // Euclidean metric p = 2 // Manhattan metric p = 1 // Max metric p = infinity // // In the case of the max metric, the norm is computed by taking // the maxima of the absolute values of the components. ANN is // highly "coordinate-based" and does not support general distances // functions (e.g. those obeying just the triangle inequality). It // also does not support distance functions based on // inner-products. // // For the purpose of computing nearest neighbors, it is not // necessary to compute the final power (1/p). Thus the only // component that is used by the program is |v(i)|^p. // // ANN parameterizes the distance computation through the following // macros. (Macros are used rather than procedures for // efficiency.) Recall that the distance between two points is // given by the length of the vector joining them, and the length // or norm of a vector v is given by formula: // // |v| = ROOT(POW(v0) # POW(v1) # ... # POW(v(d-1))) // // where ROOT, POW are unary functions and # is an associative and // commutative binary operator mapping the following types: // // ** POW: ANNcoord --> ANNdist // ** #: ANNdist x ANNdist --> ANNdist // ** ROOT: ANNdist (>0) --> double // // For early termination in distance calculation (partial distance // calculation) we assume that POW and # together are monotonically // increasing on sequences of arguments, meaning that for all // v0..vk and y: // // POW(v0) #...# POW(vk) <= (POW(v0) #...# POW(vk)) # POW(y). // // Incremental Distance Calculation: // The program uses an optimized method of computing distances for // kd-trees and bd-trees, called incremental distance calculation. // It is used when distances are to be updated when only a single // coordinate of a point has been changed. In order to use this, // we assume that there is an incremental update function DIFF(x,y) // for #, such that if: // // s = x0 # ... # xi # ... # xk // // then if s' is equal to s but with xi replaced by y, that is, // // s' = x0 # ... # y # ... # xk // // then the length of s' can be computed by: // // |s'| = |s| # DIFF(xi,y). // // Thus, if # is + then DIFF(xi,y) is (yi-x). For the L_infinity // norm we make use of the fact that in the program this function // is only invoked when y > xi, and hence DIFF(xi,y)=y. // // Finally, for approximate nearest neighbor queries we assume // that POW and ROOT are related such that // // v*ROOT(x) = ROOT(POW(v)*x) // // Here are the values for the various Minkowski norms: // // L_p: p even: p odd: // ------------------------- ------------------------ // POW(v) = v^p POW(v) = |v|^p // ROOT(x) = x^(1/p) ROOT(x) = x^(1/p) // # = + # = + // DIFF(x,y) = y - x DIFF(x,y) = y - x // // L_inf: // POW(v) = |v| // ROOT(x) = x // # = max // DIFF(x,y) = y // // By default the Euclidean norm is assumed. To change the norm, // uncomment the appropriate set of macros below. //---------------------------------------------------------------------- //---------------------------------------------------------------------- // Use the following for the Euclidean norm //---------------------------------------------------------------------- #define ANN_POW(v) ((v)*(v)) #define ANN_ROOT(x) sqrt(x) #define ANN_SUM(x,y) ((x) + (y)) #define ANN_DIFF(x,y) ((y) - (x)) //---------------------------------------------------------------------- // Use the following for the L_1 (Manhattan) norm //---------------------------------------------------------------------- // #define ANN_POW(v) fabs(v) // #define ANN_ROOT(x) (x) // #define ANN_SUM(x,y) ((x) + (y)) // #define ANN_DIFF(x,y) ((y) - (x)) //---------------------------------------------------------------------- // Use the following for a general L_p norm //---------------------------------------------------------------------- // #define ANN_POW(v) pow(fabs(v),p) // #define ANN_ROOT(x) pow(fabs(x),1/p) // #define ANN_SUM(x,y) ((x) + (y)) // #define ANN_DIFF(x,y) ((y) - (x)) //---------------------------------------------------------------------- // Use the following for the L_infinity (Max) norm //---------------------------------------------------------------------- // #define ANN_POW(v) fabs(v) // #define ANN_ROOT(x) (x) // #define ANN_SUM(x,y) ((x) > (y) ? (x) : (y)) // #define ANN_DIFF(x,y) (y) //---------------------------------------------------------------------- // Array types // The following array types are of basic interest. A point is // just a dimensionless array of coordinates, a point array is a // dimensionless array of points. A distance array is a // dimensionless array of distances and an index array is a // dimensionless array of point indices. The latter two are used // when returning the results of k-nearest neighbor queries. //---------------------------------------------------------------------- typedef ANNcoord* ANNpoint; // a point typedef ANNpoint* ANNpointArray; // an array of points typedef ANNdist* ANNdistArray; // an array of distances typedef ANNidx* ANNidxArray; // an array of point indices //---------------------------------------------------------------------- // Basic point and array utilities: // The following procedures are useful supplements to ANN's nearest // neighbor capabilities. // // annDist(): // Computes the (squared) distance between a pair of points. // Note that this routine is not used internally by ANN for // computing distance calculations. For reasons of efficiency // this is done using incremental distance calculation. Thus, // this routine cannot be modified as a method of changing the // metric. // // Because points (somewhat like strings in C) are stored as // pointers. Consequently, creating and destroying copies of // points may require storage allocation. These procedures do // this. // // annAllocPt() and annDeallocPt(): // Allocate a deallocate storage for a single point, and // return a pointer to it. The argument to AllocPt() is // used to initialize all components. // // annAllocPts() and annDeallocPts(): // Allocate and deallocate an array of points as well a // place to store their coordinates, and initializes the // points to point to their respective coordinates. It // allocates point storage in a contiguous block large // enough to store all the points. It performs no // initialization. // // annCopyPt(): // Creates a copy of a given point, allocating space for // the new point. It returns a pointer to the newly // allocated copy. //---------------------------------------------------------------------- DLL_API ANNdist annDist( int dim, // dimension of space ANNpoint p, // points ANNpoint q); DLL_API ANNpoint annAllocPt( int dim, // dimension ANNcoord c = 0); // coordinate value (all equal) DLL_API ANNpointArray annAllocPts( int n, // number of points int dim); // dimension DLL_API void annDeallocPt( ANNpoint &p); // deallocate 1 point DLL_API void annDeallocPts( ANNpointArray &pa); // point array DLL_API ANNpoint annCopyPt( int dim, // dimension ANNpoint source); // point to copy //---------------------------------------------------------------------- //Overall structure: ANN supports a number of different data structures //for approximate and exact nearest neighbor searching. These are: // // ANNbruteForce A simple brute-force search structure. // ANNkd_tree A kd-tree tree search structure. ANNbd_tree // A bd-tree tree search structure (a kd-tree with shrink // capabilities). // // At a minimum, each of these data structures support k-nearest // neighbor queries. The nearest neighbor query, annkSearch, // returns an integer identifier and the distance to the nearest // neighbor(s) and annRangeSearch returns the nearest points that // lie within a given query ball. // // Each structure is built by invoking the appropriate constructor // and passing it (at a minimum) the array of points, the total // number of points and the dimension of the space. Each structure // is also assumed to support a destructor and member functions // that return basic information about the point set. // // Note that the array of points is not copied by the data // structure (for reasons of space efficiency), and it is assumed // to be constant throughout the lifetime of the search structure. // // The search algorithm, annkSearch, is given the query point (q), // and the desired number of nearest neighbors to report (k), and // the error bound (eps) (whose default value is 0, implying exact // nearest neighbors). It returns two arrays which are assumed to // contain at least k elements: one (nn_idx) contains the indices // (within the point array) of the nearest neighbors and the other // (dd) contains the squared distances to these nearest neighbors. // // The search algorithm, annkFRSearch, is a fixed-radius kNN // search. In addition to a query point, it is given a (squared) // radius bound. (This is done for consistency, because the search // returns distances as squared quantities.) It does two things. // First, it computes the k nearest neighbors within the radius // bound, and second, it returns the total number of points lying // within the radius bound. It is permitted to set k = 0, in which // case it effectively answers a range counting query. If the // error bound epsilon is positive, then the search is approximate // in the sense that it is free to ignore any point that lies // outside a ball of radius r/(1+epsilon), where r is the given // (unsquared) radius bound. // // The generic object from which all the search structures are // dervied is given below. It is a virtual object, and is useless // by itself. //---------------------------------------------------------------------- class DLL_API ANNpointSet { public: virtual ~ANNpointSet() {} // virtual distructor virtual void annkSearch( // approx k near neighbor search ANNpoint q, // query point int k, // number of near neighbors to return ANNidxArray nn_idx, // nearest neighbor array (modified) ANNdistArray dd, // dist to near neighbors (modified) double eps=0.0 // error bound ) = 0; // pure virtual (defined elsewhere) virtual int annkFRSearch( // approx fixed-radius kNN search ANNpoint q, // query point ANNdist sqRad, // squared radius int k = 0, // number of near neighbors to return ANNidxArray nn_idx = NULL, // nearest neighbor array (modified) ANNdistArray dd = NULL, // dist to near neighbors (modified) double eps=0.0 // error bound ) = 0; // pure virtual (defined elsewhere) virtual std::pair< std::vector, std::vector > annkFRSearch2( // approx fixed-radius kNN search ANNpoint q, // query point ANNdist sqRad, // squared radius double eps=0.0 // error bound ) = 0; // pure virtual (defined elsewhere) virtual int theDim() = 0; // return dimension of space virtual int nPoints() = 0; // return number of points // return pointer to points virtual ANNpointArray thePoints() = 0; }; //---------------------------------------------------------------------- // Brute-force nearest neighbor search: // The brute-force search structure is very simple but inefficient. // It has been provided primarily for the sake of comparison with // and validation of the more complex search structures. // // Query processing is the same as described above, but the value // of epsilon is ignored, since all distance calculations are // performed exactly. // // WARNING: This data structure is very slow, and should not be // used unless the number of points is very small. // // Internal information: // --------------------- // This data structure bascially consists of the array of points // (each a pointer to an array of coordinates). The search is // performed by a simple linear scan of all the points. //---------------------------------------------------------------------- class DLL_API ANNbruteForce: public ANNpointSet { int dim; // dimension int n_pts; // number of points ANNpointArray pts; // point array public: ANNbruteForce( // constructor from point array ANNpointArray pa, // point array int n, // number of points int dd); // dimension ~ANNbruteForce(); // destructor void annkSearch( // approx k near neighbor search ANNpoint q, // query point int k, // number of near neighbors to return ANNidxArray nn_idx, // nearest neighbor array (modified) ANNdistArray dd, // dist to near neighbors (modified) double eps=0.0); // error bound int annkFRSearch( // approx fixed-radius kNN search ANNpoint q, // query point ANNdist sqRad, // squared radius int k = 0, // number of near neighbors to return ANNidxArray nn_idx = NULL, // nearest neighbor array (modified) ANNdistArray dd = NULL, // dist to near neighbors (modified) double eps=0.0); // error bound std::pair< std::vector, std::vector > annkFRSearch2( // approx fixed-radius kNN search ANNpoint q, // query point ANNdist sqRad, // squared radius double eps=0.0); // error bound int theDim() // return dimension of space { return dim; } int nPoints() // return number of points { return n_pts; } ANNpointArray thePoints() // return pointer to points { return pts; } }; //---------------------------------------------------------------------- // kd- and bd-tree splitting and shrinking rules // kd-trees supports a collection of different splitting rules. // In addition to the standard kd-tree splitting rule proposed // by Friedman, Bentley, and Finkel, we have introduced a // number of other splitting rules, which seem to perform // as well or better (for the distributions we have tested). // // The splitting methods given below allow the user to tailor // the data structure to the particular data set. They are // are described in greater details in the kd_split.cc source // file. The method ANN_KD_SUGGEST is the method chosen (rather // subjectively) by the implementors as the one giving the // fastest performance, and is the default splitting method. // // As with splitting rules, there are a number of different // shrinking rules. The shrinking rule ANN_BD_NONE does no // shrinking (and hence produces a kd-tree tree). The rule // ANN_BD_SUGGEST uses the implementors favorite rule. //---------------------------------------------------------------------- enum ANNsplitRule { ANN_KD_STD = 0, // the optimized kd-splitting rule ANN_KD_MIDPT = 1, // midpoint split ANN_KD_FAIR = 2, // fair split ANN_KD_SL_MIDPT = 3, // sliding midpoint splitting method ANN_KD_SL_FAIR = 4, // sliding fair split method ANN_KD_SUGGEST = 5}; // the authors' suggestion for best const int ANN_N_SPLIT_RULES = 6; // number of split rules enum ANNshrinkRule { ANN_BD_NONE = 0, // no shrinking at all (just kd-tree) ANN_BD_SIMPLE = 1, // simple splitting ANN_BD_CENTROID = 2, // centroid splitting ANN_BD_SUGGEST = 3}; // the authors' suggested choice const int ANN_N_SHRINK_RULES = 4; // number of shrink rules //---------------------------------------------------------------------- // kd-tree: // The main search data structure supported by ANN is a kd-tree. // The main constructor is given a set of points and a choice of // splitting method to use in building the tree. // // Construction: // ------------- // The constructor is given the point array, number of points, // dimension, bucket size (default = 1), and the splitting rule // (default = ANN_KD_SUGGEST). The point array is not copied, and // is assumed to be kept constant throughout the lifetime of the // search structure. There is also a "load" constructor that // builds a tree from a file description that was created by the // Dump operation. // // Search: // ------- // There are two search methods: // // Standard search (annkSearch()): // Searches nodes in tree-traversal order, always visiting // the closer child first. // Priority search (annkPriSearch()): // Searches nodes in order of increasing distance of the // associated cell from the query point. For many // distributions the standard search seems to work just // fine, but priority search is safer for worst-case // performance. // // Printing: // --------- // There are two methods provided for printing the tree. Print() // is used to produce a "human-readable" display of the tree, with // indenation, which is handy for debugging. Dump() produces a // format that is suitable reading by another program. There is a // "load" constructor, which constructs a tree which is assumed to // have been saved by the Dump() procedure. // // Performance and Structure Statistics: // ------------------------------------- // The procedure getStats() collects statistics information on the // tree (its size, height, etc.) See ANNperf.h for information on // the stats structure it returns. // // Internal information: // --------------------- // The data structure consists of three major chunks of storage. // The first (implicit) storage are the points themselves (pts), // which have been provided by the users as an argument to the // constructor, or are allocated dynamically if the tree is built // using the load constructor). These should not be changed during // the lifetime of the search structure. It is the user's // responsibility to delete these after the tree is destroyed. // // The second is the tree itself (which is dynamically allocated in // the constructor) and is given as a pointer to its root node // (root). These nodes are automatically deallocated when the tree // is deleted. See the file src/kd_tree.h for further information // on the structure of the tree nodes. // // Each leaf of the tree does not contain a pointer directly to a // point, but rather contains a pointer to a "bucket", which is an // array consisting of point indices. The third major chunk of // storage is an array (pidx), which is a large array in which all // these bucket subarrays reside. (The reason for storing them // separately is the buckets are typically small, but of varying // sizes. This was done to avoid fragmentation.) This array is // also deallocated when the tree is deleted. // // In addition to this, the tree consists of a number of other // pieces of information which are used in searching and for // subsequent tree operations. These consist of the following: // // dim Dimension of space // n_pts Number of points currently in the tree // n_max Maximum number of points that are allowed // in the tree // bkt_size Maximum bucket size (no. of points per leaf) // bnd_box_lo Bounding box low point // bnd_box_hi Bounding box high point // splitRule Splitting method used // //---------------------------------------------------------------------- //---------------------------------------------------------------------- // Some types and objects used by kd-tree functions // See src/kd_tree.h and src/kd_tree.cpp for definitions //---------------------------------------------------------------------- class ANNkdStats; // stats on kd-tree class ANNkd_node; // generic node in a kd-tree typedef ANNkd_node* ANNkd_ptr; // pointer to a kd-tree node class DLL_API ANNkd_tree: public ANNpointSet { protected: int dim; // dimension of space int n_pts; // number of points in tree int bkt_size; // bucket size ANNpointArray pts; // the points ANNidxArray pidx; // point indices (to pts array) ANNkd_ptr root; // root of kd-tree ANNpoint bnd_box_lo; // bounding box low point ANNpoint bnd_box_hi; // bounding box high point void SkeletonTree( // construct skeleton tree int n, // number of points int dd, // dimension int bs, // bucket size ANNpointArray pa = NULL, // point array (optional) ANNidxArray pi = NULL); // point indices (optional) public: ANNkd_tree( // build skeleton tree int n = 0, // number of points int dd = 0, // dimension int bs = 1); // bucket size ANNkd_tree( // build from point array ANNpointArray pa, // point array int n, // number of points int dd, // dimension int bs = 1, // bucket size ANNsplitRule split = ANN_KD_SUGGEST); // splitting method ANNkd_tree( // build from dump file std::istream& in); // input stream for dump file ~ANNkd_tree(); // tree destructor void annkSearch( // approx k near neighbor search ANNpoint q, // query point int k, // number of near neighbors to return ANNidxArray nn_idx, // nearest neighbor array (modified) ANNdistArray dd, // dist to near neighbors (modified) double eps=0.0); // error bound void annkPriSearch( // priority k near neighbor search ANNpoint q, // query point int k, // number of near neighbors to return ANNidxArray nn_idx, // nearest neighbor array (modified) ANNdistArray dd, // dist to near neighbors (modified) double eps=0.0); // error bound int annkFRSearch( // approx fixed-radius kNN search ANNpoint q, // the query point ANNdist sqRad, // squared radius of query ball int k, // number of neighbors to return ANNidxArray nn_idx = NULL, // nearest neighbor array (modified) ANNdistArray dd = NULL, // dist to near neighbors (modified) double eps=0.0); // error bound //MFH 7/15/2015 std::pair< std::vector, std::vector > annkFRSearch2( // approx fixed-radius kNN search ANNpoint q, // the query point ANNdist sqRad, // squared radius of query ball double eps=0.0); // error bound int theDim() // return dimension of space { return dim; } int nPoints() // return number of points { return n_pts; } ANNpointArray thePoints() // return pointer to points { return pts; } virtual void Print( // print the tree (for debugging) ANNbool with_pts, // print points as well? std::ostream& out); // output stream virtual void Dump( // dump entire tree ANNbool with_pts, // print points as well? std::ostream& out); // output stream virtual void getStats( // compute tree statistics ANNkdStats& st); // the statistics (modified) }; //---------------------------------------------------------------------- // Box decomposition tree (bd-tree) // The bd-tree is inherited from a kd-tree. The main difference // in the bd-tree and the kd-tree is a new type of internal node // called a shrinking node (in the kd-tree there is only one type // of internal node, a splitting node). The shrinking node // makes it possible to generate balanced trees in which the // cells have bounded aspect ratio, by allowing the decomposition // to zoom in on regions of dense point concentration. Although // this is a nice idea in theory, few point distributions are so // densely clustered that this is really needed. //---------------------------------------------------------------------- class DLL_API ANNbd_tree: public ANNkd_tree { public: ANNbd_tree( // build skeleton tree int n, // number of points int dd, // dimension int bs = 1) // bucket size : ANNkd_tree(n, dd, bs) {} // build base kd-tree ANNbd_tree( // build from point array ANNpointArray pa, // point array int n, // number of points int dd, // dimension int bs = 1, // bucket size ANNsplitRule split = ANN_KD_SUGGEST, // splitting rule ANNshrinkRule shrink = ANN_BD_SUGGEST); // shrinking rule ANNbd_tree( // build from dump file std::istream& in); // input stream for dump file }; //---------------------------------------------------------------------- // Other functions // annMaxPtsVisit Sets a limit on the maximum number of points // to visit in the search. // annClose Can be called when all use of ANN is finished. // It clears up a minor memory leak. //---------------------------------------------------------------------- DLL_API void annMaxPtsVisit( // max. pts to visit in search int maxPts); // the limit DLL_API void annClose(); // called to end use of ANN #endif ================================================ FILE: src/ANN/ANNperf.h ================================================ //---------------------------------------------------------------------- // File: ANNperf.h // Programmer: Sunil Arya and David Mount // Last modified: 03/04/98 (Release 0.1) // Description: Include file for ANN performance stats // // Some of the code for statistics gathering has been adapted // from the SmplStat.h package in the g++ library. //---------------------------------------------------------------------- // Copyright (c) 1997-2005 University of Maryland and Sunil Arya and // David Mount. All Rights Reserved. // // This software and related documentation is part of the Approximate // Nearest Neighbor Library (ANN). This software is provided under // the provisions of the Lesser GNU Public License (LGPL). See the // file ../ReadMe.txt for further information. // // The University of Maryland (U.M.) and the authors make no // representations about the suitability or fitness of this software for // any purpose. It is provided "as is" without express or implied // warranty. //---------------------------------------------------------------------- // History: // Revision 0.1 03/04/98 // Initial release // Revision 1.0 04/01/05 // Added ANN_ prefix to avoid name conflicts. //---------------------------------------------------------------------- #ifndef ANNperf_H #define ANNperf_H //---------------------------------------------------------------------- // basic includes //---------------------------------------------------------------------- #include "ANN.h" // basic ANN includes //---------------------------------------------------------------------- // kd-tree stats object // This object is used for collecting information about a kd-tree // or bd-tree. //---------------------------------------------------------------------- class ANNkdStats { // stats on kd-tree public: int dim; // dimension of space int n_pts; // no. of points int bkt_size; // bucket size int n_lf; // no. of leaves (including trivial) int n_tl; // no. of trivial leaves (no points) int n_spl; // no. of splitting nodes int n_shr; // no. of shrinking nodes (for bd-trees) int depth; // depth of tree float sum_ar; // sum of leaf aspect ratios float avg_ar; // average leaf aspect ratio // // reset stats void reset(int d=0, int n=0, int bs=0) { dim = d; n_pts = n; bkt_size = bs; n_lf = n_tl = n_spl = n_shr = depth = 0; sum_ar = avg_ar = 0.0; } ANNkdStats() // basic constructor { reset(); } void merge(const ANNkdStats &st); // merge stats from child }; //---------------------------------------------------------------------- // ANNsampStat // A sample stat collects numeric (double) samples and returns some // simple statistics. Its main functions are: // // reset() Reset to no samples. // += x Include sample x. // samples() Return number of samples. // mean() Return mean of samples. // stdDev() Return standard deviation // min() Return minimum of samples. // max() Return maximum of samples. //---------------------------------------------------------------------- class DLL_API ANNsampStat { int n; // number of samples double sum; // sum double sum2; // sum of squares double minVal, maxVal; // min and max public : void reset() // reset everything { n = 0; sum = sum2 = 0; minVal = ANN_DBL_MAX; maxVal = -ANN_DBL_MAX; } ANNsampStat() { reset(); } // constructor void operator+=(double x) // add sample { n++; sum += x; sum2 += x*x; if (x < minVal) minVal = x; if (x > maxVal) maxVal = x; } int samples() { return n; } // number of samples double mean() { return sum/n; } // mean // standard deviation double stdDev() { return std::sqrt((sum2 - (sum*sum)/n)/(n-1));} double min() { return minVal; } // minimum double max() { return maxVal; } // maximum }; //---------------------------------------------------------------------- // Operation count updates //---------------------------------------------------------------------- #ifdef ANN_PERF #define ANN_FLOP(n) {ann_Nfloat_ops += (n);} #define ANN_LEAF(n) {ann_Nvisit_lfs += (n);} #define ANN_SPL(n) {ann_Nvisit_spl += (n);} #define ANN_SHR(n) {ann_Nvisit_shr += (n);} #define ANN_PTS(n) {ann_Nvisit_pts += (n);} #define ANN_COORD(n) {ann_Ncoord_hts += (n);} #else #define ANN_FLOP(n) #define ANN_LEAF(n) #define ANN_SPL(n) #define ANN_SHR(n) #define ANN_PTS(n) #define ANN_COORD(n) #endif //---------------------------------------------------------------------- // Performance statistics // The following data and routines are used for computing performance // statistics for nearest neighbor searching. Because these routines // can slow the code down, they can be activated and deactiviated by // defining the ANN_PERF variable, by compiling with the option: // -DANN_PERF //---------------------------------------------------------------------- //---------------------------------------------------------------------- // Global counters for performance measurement // // visit_lfs The number of leaf nodes visited in the // tree. // // visit_spl The number of splitting nodes visited in the // tree. // // visit_shr The number of shrinking nodes visited in the // tree. // // visit_pts The number of points visited in all the // leaf nodes visited. Equivalently, this // is the number of points for which distance // calculations are performed. // // coord_hts The number of times a coordinate of a // data point is accessed. This is generally // less than visit_pts*d if partial distance // calculation is used. This count is low // in the sense that if a coordinate is hit // many times in the same routine we may // count it only once. // // float_ops The number of floating point operations. // This includes all operations in the heap // as well as distance calculations to boxes. // // average_err The average error of each query (the // error of the reported point to the true // nearest neighbor). For k nearest neighbors // the error is computed k times. // // rank_err The rank error of each query (the difference // in the rank of the reported point and its // true rank). // // data_pts The number of data points. This is not // a counter, but used in stats computation. //---------------------------------------------------------------------- extern int ann_Ndata_pts; // number of data points extern int ann_Nvisit_lfs; // number of leaf nodes visited extern int ann_Nvisit_spl; // number of splitting nodes visited extern int ann_Nvisit_shr; // number of shrinking nodes visited extern int ann_Nvisit_pts; // visited points for one query extern int ann_Ncoord_hts; // coordinate hits for one query extern int ann_Nfloat_ops; // floating ops for one query extern ANNsampStat ann_visit_lfs; // stats on leaf nodes visits extern ANNsampStat ann_visit_spl; // stats on splitting nodes visits extern ANNsampStat ann_visit_shr; // stats on shrinking nodes visits extern ANNsampStat ann_visit_nds; // stats on total nodes visits extern ANNsampStat ann_visit_pts; // stats on points visited extern ANNsampStat ann_coord_hts; // stats on coordinate hits extern ANNsampStat ann_float_ops; // stats on floating ops //---------------------------------------------------------------------- // The following need to be part of the public interface, because // they are accessed outside the DLL in ann_test.cpp. //---------------------------------------------------------------------- DLL_API extern ANNsampStat ann_average_err; // average error DLL_API extern ANNsampStat ann_rank_err; // rank error //---------------------------------------------------------------------- // Declaration of externally accessible routines for statistics //---------------------------------------------------------------------- DLL_API void annResetStats(int data_size); // reset stats for a set of queries DLL_API void annResetCounts(); // reset counts for one queries DLL_API void annUpdateStats(); // update stats with current counts DLL_API void annPrintStats(ANNbool validate); // print statistics for a run #endif ================================================ FILE: src/ANN/ANNx.h ================================================ //---------------------------------------------------------------------- // File: ANNx.h // Programmer: Sunil Arya and David Mount // Last modified: 03/04/98 (Release 0.1) // Description: Internal include file for ANN // // These declarations are of use in manipulating some of // the internal data objects appearing in ANN, but are not // needed for applications just using the nearest neighbor // search. // // Typical users of ANN should not need to access this file. //---------------------------------------------------------------------- // Copyright (c) 1997-2005 University of Maryland and Sunil Arya and // David Mount. All Rights Reserved. // // This software and related documentation is part of the Approximate // Nearest Neighbor Library (ANN). This software is provided under // the provisions of the Lesser GNU Public License (LGPL). See the // file ../ReadMe.txt for further information. // // The University of Maryland (U.M.) and the authors make no // representations about the suitability or fitness of this software for // any purpose. It is provided "as is" without express or implied // warranty. //---------------------------------------------------------------------- // History: // Revision 0.1 03/04/98 // Initial release // Revision 1.0 04/01/05 // Changed LO, HI, IN, OUT to ANN_LO, ANN_HI, etc. //---------------------------------------------------------------------- #ifndef ANNx_H #define ANNx_H #include // I/O manipulators #include "ANN.h" // ANN includes //---------------------------------------------------------------------- // Global constants and types //---------------------------------------------------------------------- enum {ANN_LO=0, ANN_HI=1}; // splitting indices enum {ANN_IN=0, ANN_OUT=1}; // shrinking indices // what to do in case of error enum ANNerr {ANNwarn = 0, ANNabort = 1}; //---------------------------------------------------------------------- // Maximum number of points to visit // We have an option for terminating the search early if the // number of points visited exceeds some threshold. If the // threshold is 0 (its default) this means there is no limit // and the algorithm applies its normal termination condition. //---------------------------------------------------------------------- extern int ANNmaxPtsVisited; // maximum number of pts visited extern int ANNptsVisited; // number of pts visited in search //---------------------------------------------------------------------- // Global function declarations //---------------------------------------------------------------------- void annError( // ANN error routine const char *msg, // error message ANNerr level); // level of error void annPrintPt( // print a point ANNpoint pt, // the point int dim, // the dimension std::ostream &out); // output stream //---------------------------------------------------------------------- // Orthogonal (axis aligned) rectangle // Orthogonal rectangles are represented by two points, one // for the lower left corner (min coordinates) and the other // for the upper right corner (max coordinates). // // The constructor initializes from either a pair of coordinates, // pair of points, or another rectangle. Note that all constructors // allocate new point storage. The destructor deallocates this // storage. // // BEWARE: Orthogonal rectangles should be passed ONLY BY REFERENCE. // (C++'s default copy constructor will not allocate new point // storage, then on return the destructor free's storage, and then // you get into big trouble in the calling procedure.) //---------------------------------------------------------------------- class ANNorthRect { public: ANNpoint lo; // rectangle lower bounds ANNpoint hi; // rectangle upper bounds // ANNorthRect( // basic constructor int dd, // dimension of space ANNcoord l=0, // default is empty ANNcoord h=0) { lo = annAllocPt(dd, l); hi = annAllocPt(dd, h); } ANNorthRect( // (almost a) copy constructor int dd, // dimension const ANNorthRect &r) // rectangle to copy { lo = annCopyPt(dd, r.lo); hi = annCopyPt(dd, r.hi); } ANNorthRect( // construct from points int dd, // dimension ANNpoint l, // low point ANNpoint h) // hight point { lo = annCopyPt(dd, l); hi = annCopyPt(dd, h); } ~ANNorthRect() // destructor { annDeallocPt(lo); annDeallocPt(hi); } ANNbool inside(int dim, ANNpoint p);// is point p inside rectangle? }; void annAssignRect( // assign one rect to another int dim, // dimension (both must be same) ANNorthRect &dest, // destination (modified) const ANNorthRect &source); // source //---------------------------------------------------------------------- // Orthogonal (axis aligned) halfspace // An orthogonal halfspace is represented by an integer cutting // dimension cd, coordinate cutting value, cv, and side, sd, which is // either +1 or -1. Our convention is that point q lies in the (closed) // halfspace if (q[cd] - cv)*sd >= 0. //---------------------------------------------------------------------- class ANNorthHalfSpace { public: int cd; // cutting dimension ANNcoord cv; // cutting value int sd; // which side // ANNorthHalfSpace() // default constructor { cd = 0; cv = 0; sd = 0; } ANNorthHalfSpace( // basic constructor int cdd, // dimension of space ANNcoord cvv, // cutting value int sdd) // side { cd = cdd; cv = cvv; sd = sdd; } ANNbool in(ANNpoint q) const // is q inside halfspace? { return (ANNbool) ((q[cd] - cv)*sd >= 0); } ANNbool out(ANNpoint q) const // is q outside halfspace? { return (ANNbool) ((q[cd] - cv)*sd < 0); } ANNdist dist(ANNpoint q) const // (squared) distance from q { return (ANNdist) ANN_POW(q[cd] - cv); } void setLowerBound(int d, ANNpoint p)// set to lower bound at p[i] { cd = d; cv = p[d]; sd = +1; } void setUpperBound(int d, ANNpoint p)// set to upper bound at p[i] { cd = d; cv = p[d]; sd = -1; } void project(ANNpoint &q) // project q (modified) onto halfspace { if (out(q)) q[cd] = cv; } }; // array of halfspaces typedef ANNorthHalfSpace *ANNorthHSArray; #endif ================================================ FILE: src/ANN/Copyright.txt ================================================ ANN: Approximate Nearest Neighbors Version: 1.1 Release Date: May 3, 2005 ---------------------------------------------------------------------------- Copyright (c) 1997-2005 University of Maryland and Sunil Arya and David Mount All Rights Reserved. This program is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser Public License for more details. A copy of the terms and conditions of the license can be found in License.txt or online at http://www.gnu.org/copyleft/lesser.html To obtain a copy, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. Disclaimer ---------- The University of Maryland and the authors make no representations about the suitability or fitness of this software for any purpose. It is provided "as is" without express or implied warranty. --------------------------------------------------------------------- Authors ------- David Mount Dept of Computer Science University of Maryland, College Park, MD 20742 USA mount@cs.umd.edu http://www.cs.umd.edu/~mount/ Sunil Arya Dept of Computer Science Hong University of Science and Technology Clearwater Bay, HONG KONG arya@cs.ust.hk http://www.cs.ust.hk/faculty/arya/ ================================================ FILE: src/ANN/License.txt ================================================ ---------------------------------------------------------------------- The ANN Library (all versions) is provided under the terms and conditions of the GNU Lesser General Public Library, which is stated below. It can also be found at: http://www.gnu.org/copyleft/lesser.html ---------------------------------------------------------------------- GNU LESSER GENERAL PUBLIC LICENSE Version 2.1, February 1999 Copyright (C) 1991, 1999 Free Software Foundation, Inc. 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. [This is the first released version of the Lesser GPL. It also counts as the successor of the GNU Library Public License, version 2, hence the version number 2.1.] Preamble The licenses for most software are designed to take away your freedom to share and change it. By contrast, the GNU General Public Licenses are intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users. This license, the Lesser General Public License, applies to some specially designated software packages--typically libraries--of the Free Software Foundation and other authors who decide to use it. You can use it too, but we suggest you first think carefully about whether this license or the ordinary General Public License is the better strategy to use in any particular case, based on the explanations below. When we speak of free software, we are referring to freedom of use, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish); that you receive source code or can get it if you want it; that you can change the software and use pieces of it in new free programs; and that you are informed that you can do these things. To protect your rights, we need to make restrictions that forbid distributors to deny you these rights or to ask you to surrender these rights. These restrictions translate to certain responsibilities for you if you distribute copies of the library or if you modify it. For example, if you distribute copies of the library, whether gratis or for a fee, you must give the recipients all the rights that we gave you. You must make sure that they, too, receive or can get the source code. If you link other code with the library, you must provide complete object files to the recipients, so that they can relink them with the library after making changes to the library and recompiling it. And you must show them these terms so they know their rights. We protect your rights with a two-step method: (1) we copyright the library, and (2) we offer you this license, which gives you legal permission to copy, distribute and/or modify the library. To protect each distributor, we want to make it very clear that there is no warranty for the free library. Also, if the library is modified by someone else and passed on, the recipients should know that what they have is not the original version, so that the original author's reputation will not be affected by problems that might be introduced by others. Finally, software patents pose a constant threat to the existence of any free program. We wish to make sure that a company cannot effectively restrict the users of a free program by obtaining a restrictive license from a patent holder. Therefore, we insist that any patent license obtained for a version of the library must be consistent with the full freedom of use specified in this license. Most GNU software, including some libraries, is covered by the ordinary GNU General Public License. This license, the GNU Lesser General Public License, applies to certain designated libraries, and is quite different from the ordinary General Public License. We use this license for certain libraries in order to permit linking those libraries into non-free programs. When a program is linked with a library, whether statically or using a shared library, the combination of the two is legally speaking a combined work, a derivative of the original library. The ordinary General Public License therefore permits such linking only if the entire combination fits its criteria of freedom. The Lesser General Public License permits more lax criteria for linking other code with the library. We call this license the "Lesser" General Public License because it does Less to protect the user's freedom than the ordinary General Public License. It also provides other free software developers Less of an advantage over competing non-free programs. These disadvantages are the reason we use the ordinary General Public License for many libraries. However, the Lesser license provides advantages in certain special circumstances. For example, on rare occasions, there may be a special need to encourage the widest possible use of a certain library, so that it becomes a de-facto standard. To achieve this, non-free programs must be allowed to use the library. A more frequent case is that a free library does the same job as widely used non-free libraries. In this case, there is little to gain by limiting the free library to free software only, so we use the Lesser General Public License. In other cases, permission to use a particular library in non-free programs enables a greater number of people to use a large body of free software. For example, permission to use the GNU C Library in non-free programs enables many more people to use the whole GNU operating system, as well as its variant, the GNU/Linux operating system. Although the Lesser General Public License is Less protective of the users' freedom, it does ensure that the user of a program that is linked with the Library has the freedom and the wherewithal to run that program using a modified version of the Library. The precise terms and conditions for copying, distribution and modification follow. Pay close attention to the difference between a "work based on the library" and a "work that uses the library". The former contains code derived from the library, whereas the latter must be combined with the library in order to run. TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 0. This License Agreement applies to any software library or other program which contains a notice placed by the copyright holder or other authorized party saying it may be distributed under the terms of this Lesser General Public License (also called "this License"). Each licensee is addressed as "you". A "library" means a collection of software functions and/or data prepared so as to be conveniently linked with application programs (which use some of those functions and data) to form executables. The "Library", below, refers to any such software library or work which has been distributed under these terms. A "work based on the Library" means either the Library or any derivative work under copyright law: that is to say, a work containing the Library or a portion of it, either verbatim or with modifications and/or translated straightforwardly into another language. (Hereinafter, translation is included without limitation in the term "modification".) "Source code" for a work means the preferred form of the work for making modifications to it. For a library, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the library. Activities other than copying, distribution and modification are not covered by this License; they are outside its scope. The act of running a program using the Library is not restricted, and output from such a program is covered only if its contents constitute a work based on the Library (independent of the use of the Library in a tool for writing it). Whether that is true depends on what the Library does and what the program that uses the Library does. 1. You may copy and distribute verbatim copies of the Library's complete source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this License and to the absence of any warranty; and distribute a copy of this License along with the Library. You may charge a fee for the physical act of transferring a copy, and you may at your option offer warranty protection in exchange for a fee. 2. You may modify your copy or copies of the Library or any portion of it, thus forming a work based on the Library, and copy and distribute such modifications or work under the terms of Section 1 above, provided that you also meet all of these conditions: a) The modified work must itself be a software library. b) You must cause the files modified to carry prominent notices stating that you changed the files and the date of any change. c) You must cause the whole of the work to be licensed at no charge to all third parties under the terms of this License. d) If a facility in the modified Library refers to a function or a table of data to be supplied by an application program that uses the facility, other than as an argument passed when the facility is invoked, then you must make a good faith effort to ensure that, in the event an application does not supply such function or table, the facility still operates, and performs whatever part of its purpose remains meaningful. (For example, a function in a library to compute square roots has a purpose that is entirely well-defined independent of the application. Therefore, Subsection 2d requires that any application-supplied function or table used by this function must be optional: if the application does not supply it, the square root function must still compute square roots.) These requirements apply to the modified work as a whole. If identifiable sections of that work are not derived from the Library, and can be reasonably considered independent and separate works in themselves, then this License, and its terms, do not apply to those sections when you distribute them as separate works. But when you distribute the same sections as part of a whole which is a work based on the Library, the distribution of the whole must be on the terms of this License, whose permissions for other licensees extend to the entire whole, and thus to each and every part regardless of who wrote it. Thus, it is not the intent of this section to claim rights or contest your rights to work written entirely by you; rather, the intent is to exercise the right to control the distribution of derivative or collective works based on the Library. In addition, mere aggregation of another work not based on the Library with the Library (or with a work based on the Library) on a volume of a storage or distribution medium does not bring the other work under the scope of this License. 3. You may opt to apply the terms of the ordinary GNU General Public License instead of this License to a given copy of the Library. To do this, you must alter all the notices that refer to this License, so that they refer to the ordinary GNU General Public License, version 2, instead of to this License. (If a newer version than version 2 of the ordinary GNU General Public License has appeared, then you can specify that version instead if you wish.) Do not make any other change in these notices. Once this change is made in a given copy, it is irreversible for that copy, so the ordinary GNU General Public License applies to all subsequent copies and derivative works made from that copy. This option is useful when you wish to copy part of the code of the Library into a program that is not a library. 4. You may copy and distribute the Library (or a portion or derivative of it, under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange. If distribution of object code is made by offering access to copy from a designated place, then offering equivalent access to copy the source code from the same place satisfies the requirement to distribute the source code, even though third parties are not compelled to copy the source along with the object code. 5. A program that contains no derivative of any portion of the Library, but is designed to work with the Library by being compiled or linked with it, is called a "work that uses the Library". Such a work, in isolation, is not a derivative work of the Library, and therefore falls outside the scope of this License. However, linking a "work that uses the Library" with the Library creates an executable that is a derivative of the Library (because it contains portions of the Library), rather than a "work that uses the library". The executable is therefore covered by this License. Section 6 states terms for distribution of such executables. When a "work that uses the Library" uses material from a header file that is part of the Library, the object code for the work may be a derivative work of the Library even though the source code is not. Whether this is true is especially significant if the work can be linked without the Library, or if the work is itself a library. The threshold for this to be true is not precisely defined by law. If such an object file uses only numerical parameters, data structure layouts and accessors, and small macros and small inline functions (ten lines or less in length), then the use of the object file is unrestricted, regardless of whether it is legally a derivative work. (Executables containing this object code plus portions of the Library will still fall under Section 6.) Otherwise, if the work is a derivative of the Library, you may distribute the object code for the work under the terms of Section 6. Any executables containing that work also fall under Section 6, whether or not they are linked directly with the Library itself. 6. As an exception to the Sections above, you may also combine or link a "work that uses the Library" with the Library to produce a work containing portions of the Library, and distribute that work under terms of your choice, provided that the terms permit modification of the work for the customer's own use and reverse engineering for debugging such modifications. You must give prominent notice with each copy of the work that the Library is used in it and that the Library and its use are covered by this License. You must supply a copy of this License. If the work during execution displays copyright notices, you must include the copyright notice for the Library among them, as well as a reference directing the user to the copy of this License. Also, you must do one of these things: a) Accompany the work with the complete corresponding machine-readable source code for the Library including whatever changes were used in the work (which must be distributed under Sections 1 and 2 above); and, if the work is an executable linked with the Library, with the complete machine-readable "work that uses the Library", as object code and/or source code, so that the user can modify the Library and then relink to produce a modified executable containing the modified Library. (It is understood that the user who changes the contents of definitions files in the Library will not necessarily be able to recompile the application to use the modified definitions.) b) Use a suitable shared library mechanism for linking with the Library. A suitable mechanism is one that (1) uses at run time a copy of the library already present on the user's computer system, rather than copying library functions into the executable, and (2) will operate properly with a modified version of the library, if the user installs one, as long as the modified version is interface-compatible with the version that the work was made with. c) Accompany the work with a written offer, valid for at least three years, to give the same user the materials specified in Subsection 6a, above, for a charge no more than the cost of performing this distribution. d) If distribution of the work is made by offering access to copy from a designated place, offer equivalent access to copy the above specified materials from the same place. e) Verify that the user has already received a copy of these materials or that you have already sent this user a copy. For an executable, the required form of the "work that uses the Library" must include any data and utility programs needed for reproducing the executable from it. However, as a special exception, the materials to be distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the operating system on which the executable runs, unless that component itself accompanies the executable. It may happen that this requirement contradicts the license restrictions of other proprietary libraries that do not normally accompany the operating system. Such a contradiction means you cannot use both them and the Library together in an executable that you distribute. 7. You may place library facilities that are a work based on the Library side-by-side in a single library together with other library facilities not covered by this License, and distribute such a combined library, provided that the separate distribution of the work based on the Library and of the other library facilities is otherwise permitted, and provided that you do these two things: a) Accompany the combined library with a copy of the same work based on the Library, uncombined with any other library facilities. This must be distributed under the terms of the Sections above. b) Give prominent notice with the combined library of the fact that part of it is a work based on the Library, and explaining where to find the accompanying uncombined form of the same work. 8. You may not copy, modify, sublicense, link with, or distribute the Library except as expressly provided under this License. Any attempt otherwise to copy, modify, sublicense, link with, or distribute the Library is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance. 9. You are not required to accept this License, since you have not signed it. However, nothing else grants you permission to modify or distribute the Library or its derivative works. These actions are prohibited by law if you do not accept this License. Therefore, by modifying or distributing the Library (or any work based on the Library), you indicate your acceptance of this License to do so, and all its terms and conditions for copying, distributing or modifying the Library or works based on it. 10. Each time you redistribute the Library (or any work based on the Library), the recipient automatically receives a license from the original licensor to copy, distribute, link with or modify the Library subject to these terms and conditions. You may not impose any further restrictions on the recipients' exercise of the rights granted herein. You are not responsible for enforcing compliance by third parties with this License. 11. If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not distribute the Library at all. For example, if a patent license would not permit royalty-free redistribution of the Library by all those who receive copies directly or indirectly through you, then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Library. If any portion of this section is held invalid or unenforceable under any particular circumstance, the balance of the section is intended to apply, and the section as a whole is intended to apply in other circumstances. It is not the purpose of this section to induce you to infringe any patents or other property right claims or to contest validity of any such claims; this section has the sole purpose of protecting the integrity of the free software distribution system which is implemented by public license practices. Many people have made generous contributions to the wide range of software distributed through that system in reliance on consistent application of that system; it is up to the author/donor to decide if he or she is willing to distribute software through any other system and a licensee cannot impose that choice. This section is intended to make thoroughly clear what is believed to be a consequence of the rest of this License. 12. If the distribution and/or use of the Library is restricted in certain countries either by patents or by copyrighted interfaces, the original copyright holder who places the Library under this License may add an explicit geographical distribution limitation excluding those countries, so that distribution is permitted only in or among countries not thus excluded. In such case, this License incorporates the limitation as if written in the body of this License. 13. The Free Software Foundation may publish revised and/or new versions of the Lesser General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Library specifies a version number of this License which applies to it and "any later version", you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. If the Library does not specify a license version number, you may choose any version ever published by the Free Software Foundation. 14. If you wish to incorporate parts of the Library into other free programs whose distribution conditions are incompatible with these, write to the author to ask for permission. For software which is copyrighted by the Free Software Foundation, write to the Free Software Foundation; we sometimes make exceptions for this. Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally. NO WARRANTY 15. BECAUSE THE LIBRARY IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE LIBRARY, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE LIBRARY "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE LIBRARY IS WITH YOU. SHOULD THE LIBRARY PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 16. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE LIBRARY AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE LIBRARY (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE LIBRARY TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ================================================ FILE: src/ANN/ReadMe.txt ================================================ ANN: Approximate Nearest Neighbors Version: 1.1 Release date: May 3, 2005 ---------------------------------------------------------------------------- Copyright (c) 1997-2005 University of Maryland and Sunil Arya and David Mount. All Rights Reserved. See Copyright.txt and License.txt for complete information on terms and conditions of use and distribution of this software. ---------------------------------------------------------------------------- Authors ------- David Mount Dept of Computer Science University of Maryland, College Park, MD 20742 USA mount@cs.umd.edu http://www.cs.umd.edu/~mount/ Sunil Arya Dept of Computer Science Hong University of Science and Technology Clearwater Bay, HONG KONG arya@cs.ust.hk http://www.cs.ust.hk/faculty/arya/ Introduction ------------ ANN is a library written in the C++ programming language to support both exact and approximate nearest neighbor searching in spaces of various dimensions. It was implemented by David M. Mount of the University of Maryland, and Sunil Arya of the Hong Kong University of Science and Technology. ANN (pronounced like the name ``Ann'') stands for Approximate Nearest Neighbors. ANN is also a testbed containing programs and procedures for generating data sets, collecting and analyzing statistics on the performance of nearest neighbor algorithms and data structures, and visualizing the geometric structure of these data structures. The ANN source code and documentation is available from the following web page: http://www.cs.umd.edu/~mount/ANN For more information on ANN and its use, see the ``ANN Programming Manual,'' which is provided with the software distribution. ---------------------------------------------------------------------------- History Version 0.1 03/04/98 Preliminary release Version 0.2 06/24/98 Changes for SGI compiler. Version 1.0 04/01/05 Fixed a number of small bugs Added dump/load operations Added annClose to eliminate minor memory leak Improved compatibility with current C++ compilers Added compilation for Microsoft Visual Studio.NET Added compilation for Linux 2.x Version 1.1 05/03/05 Added make target for Mac OS X Added fixed-radius range searching and counting Added instructions on compiling/using ANN on Windows platforms Fixed minor output bug in ann2fig ================================================ FILE: src/ANN/bd_fix_rad_search.cpp ================================================ //---------------------------------------------------------------------- // File: bd_fix_rad_search.cpp // Programmer: David Mount // Description: Standard bd-tree search // Last modified: 05/03/05 (Version 1.1) //---------------------------------------------------------------------- // Copyright (c) 1997-2005 University of Maryland and Sunil Arya and // David Mount. All Rights Reserved. // // This software and related documentation is part of the Approximate // Nearest Neighbor Library (ANN). This software is provided under // the provisions of the Lesser GNU Public License (LGPL). See the // file ../ReadMe.txt for further information. // // The University of Maryland (U.M.) and the authors make no // representations about the suitability or fitness of this software for // any purpose. It is provided "as is" without express or implied // warranty. //---------------------------------------------------------------------- // History: // Revision 1.1 05/03/05 // Initial release //---------------------------------------------------------------------- #include "bd_tree.h" // bd-tree declarations #include "kd_fix_rad_search.h" // kd-tree FR search declarations //---------------------------------------------------------------------- // Approximate searching for bd-trees. // See the file kd_FR_search.cpp for general information on the // approximate nearest neighbor search algorithm. Here we // include the extensions for shrinking nodes. //---------------------------------------------------------------------- //---------------------------------------------------------------------- // bd_shrink::ann_FR_search - search a shrinking node //---------------------------------------------------------------------- void ANNbd_shrink::ann_FR_search(ANNdist box_dist) { // check dist calc term cond. if (ANNmaxPtsVisited != 0 && ANNptsVisited > ANNmaxPtsVisited) return; ANNdist inner_dist = 0; // distance to inner box for (int i = 0; i < n_bnds; i++) { // is query point in the box? if (bnds[i].out(ANNkdFRQ)) { // outside this bounding side? // add to inner distance inner_dist = (ANNdist) ANN_SUM(inner_dist, bnds[i].dist(ANNkdFRQ)); } } if (inner_dist <= box_dist) { // if inner box is closer child[ANN_IN]->ann_FR_search(inner_dist);// search inner child first child[ANN_OUT]->ann_FR_search(box_dist);// ...then outer child } else { // if outer box is closer child[ANN_OUT]->ann_FR_search(box_dist);// search outer child first child[ANN_IN]->ann_FR_search(inner_dist);// ...then outer child } ANN_FLOP(3*n_bnds) // increment floating ops ANN_SHR(1) // one more shrinking node } ================================================ FILE: src/ANN/bd_pr_search.cpp ================================================ //---------------------------------------------------------------------- // File: bd_pr_search.cpp // Programmer: David Mount // Description: Priority search for bd-trees // Last modified: 01/04/05 (Version 1.0) //---------------------------------------------------------------------- // Copyright (c) 1997-2005 University of Maryland and Sunil Arya and // David Mount. All Rights Reserved. // // This software and related documentation is part of the Approximate // Nearest Neighbor Library (ANN). This software is provided under // the provisions of the Lesser GNU Public License (LGPL). See the // file ../ReadMe.txt for further information. // // The University of Maryland (U.M.) and the authors make no // representations about the suitability or fitness of this software for // any purpose. It is provided "as is" without express or implied // warranty. //---------------------------------------------------------------------- //History: // Revision 0.1 03/04/98 // Initial release //---------------------------------------------------------------------- #include "bd_tree.h" // bd-tree declarations #include "kd_pr_search.h" // kd priority search declarations //---------------------------------------------------------------------- // Approximate priority searching for bd-trees. // See the file kd_pr_search.cc for general information on the // approximate nearest neighbor priority search algorithm. Here // we include the extensions for shrinking nodes. //---------------------------------------------------------------------- //---------------------------------------------------------------------- // bd_shrink::ann_search - search a shrinking node //---------------------------------------------------------------------- void ANNbd_shrink::ann_pri_search(ANNdist box_dist) { ANNdist inner_dist = 0; // distance to inner box for (int i = 0; i < n_bnds; i++) { // is query point in the box? if (bnds[i].out(ANNprQ)) { // outside this bounding side? // add to inner distance inner_dist = (ANNdist) ANN_SUM(inner_dist, bnds[i].dist(ANNprQ)); } } if (inner_dist <= box_dist) { // if inner box is closer if (child[ANN_OUT] != KD_TRIVIAL) // enqueue outer if not trivial ANNprBoxPQ->insert(box_dist,child[ANN_OUT]); // continue with inner child child[ANN_IN]->ann_pri_search(inner_dist); } else { // if outer box is closer if (child[ANN_IN] != KD_TRIVIAL) // enqueue inner if not trivial ANNprBoxPQ->insert(inner_dist,child[ANN_IN]); // continue with outer child child[ANN_OUT]->ann_pri_search(box_dist); } ANN_FLOP(3*n_bnds) // increment floating ops ANN_SHR(1) // one more shrinking node } ================================================ FILE: src/ANN/bd_search.cpp ================================================ //---------------------------------------------------------------------- // File: bd_search.cpp // Programmer: David Mount // Description: Standard bd-tree search // Last modified: 01/04/05 (Version 1.0) //---------------------------------------------------------------------- // Copyright (c) 1997-2005 University of Maryland and Sunil Arya and // David Mount. All Rights Reserved. // // This software and related documentation is part of the Approximate // Nearest Neighbor Library (ANN). This software is provided under // the provisions of the Lesser GNU Public License (LGPL). See the // file ../ReadMe.txt for further information. // // The University of Maryland (U.M.) and the authors make no // representations about the suitability or fitness of this software for // any purpose. It is provided "as is" without express or implied // warranty. //---------------------------------------------------------------------- // History: // Revision 0.1 03/04/98 // Initial release //---------------------------------------------------------------------- #include "bd_tree.h" // bd-tree declarations #include "kd_search.h" // kd-tree search declarations //---------------------------------------------------------------------- // Approximate searching for bd-trees. // See the file kd_search.cpp for general information on the // approximate nearest neighbor search algorithm. Here we // include the extensions for shrinking nodes. //---------------------------------------------------------------------- //---------------------------------------------------------------------- // bd_shrink::ann_search - search a shrinking node //---------------------------------------------------------------------- void ANNbd_shrink::ann_search(ANNdist box_dist) { // check dist calc term cond. if (ANNmaxPtsVisited != 0 && ANNptsVisited > ANNmaxPtsVisited) return; ANNdist inner_dist = 0; // distance to inner box for (int i = 0; i < n_bnds; i++) { // is query point in the box? if (bnds[i].out(ANNkdQ)) { // outside this bounding side? // add to inner distance inner_dist = (ANNdist) ANN_SUM(inner_dist, bnds[i].dist(ANNkdQ)); } } if (inner_dist <= box_dist) { // if inner box is closer child[ANN_IN]->ann_search(inner_dist); // search inner child first child[ANN_OUT]->ann_search(box_dist); // ...then outer child } else { // if outer box is closer child[ANN_OUT]->ann_search(box_dist); // search outer child first child[ANN_IN]->ann_search(inner_dist); // ...then outer child } ANN_FLOP(3*n_bnds) // increment floating ops ANN_SHR(1) // one more shrinking node } ================================================ FILE: src/ANN/bd_tree.cpp ================================================ //---------------------------------------------------------------------- // File: bd_tree.cpp // Programmer: David Mount // Description: Basic methods for bd-trees. // Last modified: 01/04/05 (Version 1.0) //---------------------------------------------------------------------- // Copyright (c) 1997-2005 University of Maryland and Sunil Arya and // David Mount. All Rights Reserved. // // This software and related documentation is part of the Approximate // Nearest Neighbor Library (ANN). This software is provided under // the provisions of the Lesser GNU Public License (LGPL). See the // file ../ReadMe.txt for further information. // // The University of Maryland (U.M.) and the authors make no // representations about the suitability or fitness of this software for // any purpose. It is provided "as is" without express or implied // warranty. //---------------------------------------------------------------------- // History: // Revision 0.1 03/04/98 // Initial release // Revision l.0 04/01/05 // Fixed centroid shrink threshold condition to depend on the // dimension. // Moved dump routine to kd_dump.cpp. //---------------------------------------------------------------------- #include "bd_tree.h" // bd-tree declarations #include "kd_util.h" // kd-tree utilities #include "kd_split.h" // kd-tree splitting rules #include "ANNperf.h" // performance evaluation //---------------------------------------------------------------------- // Printing a bd-tree // These routines print a bd-tree. See the analogous procedure // in kd_tree.cpp for more information. //---------------------------------------------------------------------- void ANNbd_shrink::print( // print shrinking node int level, // depth of node in tree ostream &out) // output stream { child[ANN_OUT]->print(level+1, out); // print out-child out << " "; for (int i = 0; i < level; i++) // print indentation out << ".."; out << "Shrink"; for (int j = 0; j < n_bnds; j++) { // print sides, 2 per line if (j % 2 == 0) { out << "\n"; // newline and indentation for (int i = 0; i < level+2; i++) out << " "; } out << " ([" << bnds[j].cd << "]" << (bnds[j].sd > 0 ? ">=" : "< ") << bnds[j].cv << ")"; } out << "\n"; child[ANN_IN]->print(level+1, out); // print in-child } //---------------------------------------------------------------------- // kd_tree statistics utility (for performance evaluation) // This routine computes various statistics information for // shrinking nodes. See file kd_tree.cpp for more information. //---------------------------------------------------------------------- void ANNbd_shrink::getStats( // get subtree statistics int dim, // dimension of space ANNkdStats &st, // stats (modified) ANNorthRect &bnd_box) // bounding box { ANNkdStats ch_stats; // stats for children ANNorthRect inner_box(dim); // inner box of shrink annBnds2Box(bnd_box, // enclosing box dim, // dimension n_bnds, // number of bounds bnds, // bounds array inner_box); // inner box (modified) // get stats for inner child ch_stats.reset(); // reset child[ANN_IN]->getStats(dim, ch_stats, inner_box); st.merge(ch_stats); // merge them // get stats for outer child ch_stats.reset(); // reset child[ANN_OUT]->getStats(dim, ch_stats, bnd_box); st.merge(ch_stats); // merge them st.depth++; // increment depth st.n_shr++; // increment number of shrinks } //---------------------------------------------------------------------- // bd-tree constructor // This is the main constructor for bd-trees given a set of points. // It first builds a skeleton kd-tree as a basis, then computes the // bounding box of the data points, and then invokes rbd_tree() to // actually build the tree, passing it the appropriate splitting // and shrinking information. //---------------------------------------------------------------------- ANNkd_ptr rbd_tree( // recursive construction of bd-tree ANNpointArray pa, // point array ANNidxArray pidx, // point indices to store in subtree int n, // number of points int dim, // dimension of space int bsp, // bucket space ANNorthRect &bnd_box, // bounding box for current node ANNkd_splitter splitter, // splitting routine ANNshrinkRule shrink); // shrinking rule ANNbd_tree::ANNbd_tree( // construct from point array ANNpointArray pa, // point array (with at least n pts) int n, // number of points int dd, // dimension int bs, // bucket size ANNsplitRule split, // splitting rule ANNshrinkRule shrink) // shrinking rule : ANNkd_tree(n, dd, bs) // build skeleton base tree { pts = pa; // where the points are if (n == 0) return; // no points--no sweat ANNorthRect bnd_box(dd); // bounding box for points // construct bounding rectangle annEnclRect(pa, pidx, n, dd, bnd_box); // copy to tree structure bnd_box_lo = annCopyPt(dd, bnd_box.lo); bnd_box_hi = annCopyPt(dd, bnd_box.hi); switch (split) { // build by rule case ANN_KD_STD: // standard kd-splitting rule root = rbd_tree(pa, pidx, n, dd, bs, bnd_box, kd_split, shrink); break; case ANN_KD_MIDPT: // midpoint split root = rbd_tree(pa, pidx, n, dd, bs, bnd_box, midpt_split, shrink); break; case ANN_KD_SUGGEST: // best (in our opinion) case ANN_KD_SL_MIDPT: // sliding midpoint split root = rbd_tree(pa, pidx, n, dd, bs, bnd_box, sl_midpt_split, shrink); break; case ANN_KD_FAIR: // fair split root = rbd_tree(pa, pidx, n, dd, bs, bnd_box, fair_split, shrink); break; case ANN_KD_SL_FAIR: // sliding fair split root = rbd_tree(pa, pidx, n, dd, bs, bnd_box, sl_fair_split, shrink); break; default: annError("Illegal splitting method", ANNabort); } } //---------------------------------------------------------------------- // Shrinking rules //---------------------------------------------------------------------- enum ANNdecomp {SPLIT, SHRINK}; // decomposition methods //---------------------------------------------------------------------- // trySimpleShrink - Attempt a simple shrink // // We compute the tight bounding box of the points, and compute // the 2*dim ``gaps'' between the sides of the tight box and the // bounding box. If any of the gaps is large enough relative to // the longest side of the tight bounding box, then we shrink // all sides whose gaps are large enough. (The reason for // comparing against the tight bounding box, is that after // shrinking the longest box size will decrease, and if we use // the standard bounding box, we may decide to shrink twice in // a row. Since the tight box is fixed, we cannot shrink twice // consecutively.) //---------------------------------------------------------------------- const float BD_GAP_THRESH = 0.5; // gap threshold (must be < 1) const int BD_CT_THRESH = 2; // min number of shrink sides ANNdecomp trySimpleShrink( // try a simple shrink ANNpointArray pa, // point array ANNidxArray pidx, // point indices to store in subtree int n, // number of points int dim, // dimension of space const ANNorthRect &bnd_box, // current bounding box ANNorthRect &inner_box) // inner box if shrinking (returned) { int i; // compute tight bounding box annEnclRect(pa, pidx, n, dim, inner_box); ANNcoord max_length = 0; // find longest box side for (i = 0; i < dim; i++) { ANNcoord length = inner_box.hi[i] - inner_box.lo[i]; if (length > max_length) { max_length = length; } } int shrink_ct = 0; // number of sides we shrunk for (i = 0; i < dim; i++) { // select which sides to shrink // gap between boxes ANNcoord gap_hi = bnd_box.hi[i] - inner_box.hi[i]; // big enough gap to shrink? if (gap_hi < max_length*BD_GAP_THRESH) inner_box.hi[i] = bnd_box.hi[i]; // no - expand else shrink_ct++; // yes - shrink this side // repeat for high side ANNcoord gap_lo = inner_box.lo[i] - bnd_box.lo[i]; if (gap_lo < max_length*BD_GAP_THRESH) inner_box.lo[i] = bnd_box.lo[i]; // no - expand else shrink_ct++; // yes - shrink this side } if (shrink_ct >= BD_CT_THRESH) // did we shrink enough sides? return SHRINK; else return SPLIT; } //---------------------------------------------------------------------- // tryCentroidShrink - Attempt a centroid shrink // // We repeatedly apply the splitting rule, always to the larger subset // of points, until the number of points decreases by the constant // fraction BD_FRACTION. If this takes more than dim*BD_MAX_SPLIT_FAC // splits for this to happen, then we shrink to the final inner box // Otherwise we split. //---------------------------------------------------------------------- const float BD_MAX_SPLIT_FAC = 0.5; // maximum number of splits allowed const float BD_FRACTION = 0.5; // ...to reduce points by this fraction // ...This must be < 1. ANNdecomp tryCentroidShrink( // try a centroid shrink ANNpointArray pa, // point array ANNidxArray pidx, // point indices to store in subtree int n, // number of points int dim, // dimension of space const ANNorthRect &bnd_box, // current bounding box ANNkd_splitter splitter, // splitting procedure ANNorthRect &inner_box) // inner box if shrinking (returned) { int n_sub = n; // number of points in subset int n_goal = (int) (n*BD_FRACTION); // number of point in goal int n_splits = 0; // number of splits needed // initialize inner box to bounding box annAssignRect(dim, inner_box, bnd_box); while (n_sub > n_goal) { // keep splitting until goal reached int cd; // cut dim from splitter (ignored) ANNcoord cv; // cut value from splitter (ignored) int n_lo; // number of points on low side // invoke splitting procedure (*splitter)(pa, pidx, inner_box, n_sub, dim, cd, cv, n_lo); n_splits++; // increment split count if (n_lo >= n_sub/2) { // most points on low side inner_box.hi[cd] = cv; // collapse high side n_sub = n_lo; // recurse on lower points } else { // most points on high side inner_box.lo[cd] = cv; // collapse low side pidx += n_lo; // recurse on higher points n_sub -= n_lo; } } if (n_splits > dim*BD_MAX_SPLIT_FAC)// took too many splits return SHRINK; // shrink to final subset else return SPLIT; } //---------------------------------------------------------------------- // selectDecomp - select which decomposition to use //---------------------------------------------------------------------- ANNdecomp selectDecomp( // select decomposition method ANNpointArray pa, // point array ANNidxArray pidx, // point indices to store in subtree int n, // number of points int dim, // dimension of space const ANNorthRect &bnd_box, // current bounding box ANNkd_splitter splitter, // splitting procedure ANNshrinkRule shrink, // shrinking rule ANNorthRect &inner_box) // inner box if shrinking (returned) { ANNdecomp decomp = SPLIT; // decomposition switch (shrink) { // check shrinking rule case ANN_BD_NONE: // no shrinking allowed decomp = SPLIT; break; case ANN_BD_SUGGEST: // author's suggestion case ANN_BD_SIMPLE: // simple shrink decomp = trySimpleShrink( pa, pidx, // points and indices n, dim, // number of points and dimension bnd_box, // current bounding box inner_box); // inner box if shrinking (returned) break; case ANN_BD_CENTROID: // centroid shrink decomp = tryCentroidShrink( pa, pidx, // points and indices n, dim, // number of points and dimension bnd_box, // current bounding box splitter, // splitting procedure inner_box); // inner box if shrinking (returned) break; default: annError("Illegal shrinking rule", ANNabort); } return decomp; } //---------------------------------------------------------------------- // rbd_tree - recursive procedure to build a bd-tree // // This is analogous to rkd_tree, but for bd-trees. See the // procedure rkd_tree() in kd_split.cpp for more information. // // If the number of points falls below the bucket size, then a // leaf node is created for the points. Otherwise we invoke the // procedure selectDecomp() which determines whether we are to // split or shrink. If splitting is chosen, then we essentially // do exactly as rkd_tree() would, and invoke the specified // splitting procedure to the points. Otherwise, the selection // procedure returns a bounding box, from which we extract the // appropriate shrinking bounds, and create a shrinking node. // Finally the points are subdivided, and the procedure is // invoked recursively on the two subsets to form the children. //---------------------------------------------------------------------- ANNkd_ptr rbd_tree( // recursive construction of bd-tree ANNpointArray pa, // point array ANNidxArray pidx, // point indices to store in subtree int n, // number of points int dim, // dimension of space int bsp, // bucket space ANNorthRect &bnd_box, // bounding box for current node ANNkd_splitter splitter, // splitting routine ANNshrinkRule shrink) // shrinking rule { ANNdecomp decomp; // decomposition method ANNorthRect inner_box(dim); // inner box (if shrinking) if (n <= bsp) { // n small, make a leaf node if (n == 0) // empty leaf node return KD_TRIVIAL; // return (canonical) empty leaf else // construct the node and return return new ANNkd_leaf(n, pidx); } decomp = selectDecomp( // select decomposition method pa, pidx, // points and indices n, dim, // number of points and dimension bnd_box, // current bounding box splitter, shrink, // splitting/shrinking methods inner_box); // inner box if shrinking (returned) if (decomp == SPLIT) { // split selected int cd; // cutting dimension ANNcoord cv; // cutting value int n_lo; // number on low side of cut // invoke splitting procedure (*splitter)(pa, pidx, bnd_box, n, dim, cd, cv, n_lo); ANNcoord lv = bnd_box.lo[cd]; // save bounds for cutting dimension ANNcoord hv = bnd_box.hi[cd]; bnd_box.hi[cd] = cv; // modify bounds for left subtree ANNkd_ptr lo = rbd_tree( // build left subtree pa, pidx, n_lo, // ...from pidx[0..n_lo-1] dim, bsp, bnd_box, splitter, shrink); bnd_box.hi[cd] = hv; // restore bounds bnd_box.lo[cd] = cv; // modify bounds for right subtree ANNkd_ptr hi = rbd_tree( // build right subtree pa, pidx + n_lo, n-n_lo,// ...from pidx[n_lo..n-1] dim, bsp, bnd_box, splitter, shrink); bnd_box.lo[cd] = lv; // restore bounds // create the splitting node return new ANNkd_split(cd, cv, lv, hv, lo, hi); } else { // shrink selected int n_in; // number of points in box int n_bnds; // number of bounding sides annBoxSplit( // split points around inner box pa, // points to split pidx, // point indices n, // number of points dim, // dimension inner_box, // inner box n_in); // number of points inside (returned) ANNkd_ptr in = rbd_tree( // build inner subtree pidx[0..n_in-1] pa, pidx, n_in, dim, bsp, inner_box, splitter, shrink); ANNkd_ptr out = rbd_tree( // build outer subtree pidx[n_in..n] pa, pidx+n_in, n - n_in, dim, bsp, bnd_box, splitter, shrink); ANNorthHSArray bnds = NULL; // bounds (alloc in Box2Bnds and // ...freed in bd_shrink destroyer) annBox2Bnds( // convert inner box to bounds inner_box, // inner box bnd_box, // enclosing box dim, // dimension n_bnds, // number of bounds (returned) bnds); // bounds array (modified) // return shrinking node return new ANNbd_shrink(n_bnds, bnds, in, out); } } ================================================ FILE: src/ANN/bd_tree.h ================================================ //---------------------------------------------------------------------- // File: bd_tree.h // Programmer: David Mount // Description: Declarations for standard bd-tree routines // Last modified: 01/04/05 (Version 1.0) //---------------------------------------------------------------------- // Copyright (c) 1997-2005 University of Maryland and Sunil Arya and // David Mount. All Rights Reserved. // // This software and related documentation is part of the Approximate // Nearest Neighbor Library (ANN). This software is provided under // the provisions of the Lesser GNU Public License (LGPL). See the // file ../ReadMe.txt for further information. // // The University of Maryland (U.M.) and the authors make no // representations about the suitability or fitness of this software for // any purpose. It is provided "as is" without express or implied // warranty. //---------------------------------------------------------------------- // History: // Revision 0.1 03/04/98 // Initial release // Revision 1.0 04/01/05 // Changed IN, OUT to ANN_IN, ANN_OUT //---------------------------------------------------------------------- #ifndef ANN_bd_tree_H #define ANN_bd_tree_H #include "ANNx.h" // all ANN includes #include "kd_tree.h" // kd-tree includes //---------------------------------------------------------------------- // bd-tree shrinking node. // The main addition in the bd-tree is the shrinking node, which // is declared here. // // Shrinking nodes are defined by list of orthogonal halfspaces. // These halfspaces define a (possibly unbounded) orthogonal // rectangle. There are two children, in and out. Points that // lie within this rectangle are stored in the in-child, and the // other points are stored in the out-child. // // We use a list of orthogonal halfspaces rather than an // orthogonal rectangle object because typically the number of // sides of the shrinking box will be much smaller than the // worst case bound of 2*dim. // // BEWARE: Note that constructor just copies the pointer to the // bounding array, but the destructor deallocates it. This is // rather poor practice, but happens to be convenient. The list // is allocated in the bd-tree building procedure rbd_tree() just // prior to construction, and is used for no other purposes. // // WARNING: In the near neighbor searching code it is assumed that // the list of bounding halfspaces is irredundant, meaning that there // are no two distinct halfspaces in the list with the same outward // pointing normals. //---------------------------------------------------------------------- class ANNbd_shrink : public ANNkd_node // splitting node of a kd-tree { int n_bnds; // number of bounding halfspaces ANNorthHSArray bnds; // list of bounding halfspaces ANNkd_ptr child[2]; // in and out children public: ANNbd_shrink( // constructor int nb, // number of bounding halfspaces ANNorthHSArray bds, // list of bounding halfspaces ANNkd_ptr ic=NULL, ANNkd_ptr oc=NULL) // children { n_bnds = nb; // cutting dimension bnds = bds; // assign bounds child[ANN_IN] = ic; // set children child[ANN_OUT] = oc; } ~ANNbd_shrink() // destructor { if (child[ANN_IN]!= NULL && child[ANN_IN]!= KD_TRIVIAL) delete child[ANN_IN]; if (child[ANN_OUT]!= NULL&& child[ANN_OUT]!= KD_TRIVIAL) delete child[ANN_OUT]; if (bnds != NULL) delete [] bnds; // delete bounds } virtual void getStats( // get tree statistics int dim, // dimension of space ANNkdStats &st, // statistics ANNorthRect &bnd_box); // bounding box virtual void print(int level, ostream &out);// print node virtual void dump(ostream &out); // dump node virtual void ann_search(ANNdist); // standard search virtual void ann_pri_search(ANNdist); // priority search virtual void ann_FR_search(ANNdist); // fixed-radius search }; #endif ================================================ FILE: src/ANN/brute.cpp ================================================ //---------------------------------------------------------------------- // File: brute.cpp // Programmer: Sunil Arya and David Mount // Description: Brute-force nearest neighbors // Last modified: 05/03/05 (Version 1.1) //---------------------------------------------------------------------- // Copyright (c) 1997-2005 University of Maryland and Sunil Arya and // David Mount. All Rights Reserved. // // This software and related documentation is part of the Approximate // Nearest Neighbor Library (ANN). This software is provided under // the provisions of the Lesser GNU Public License (LGPL). See the // file ../ReadMe.txt for further information. // // The University of Maryland (U.M.) and the authors make no // representations about the suitability or fitness of this software for // any purpose. It is provided "as is" without express or implied // warranty. //---------------------------------------------------------------------- // History: // Revision 0.1 03/04/98 // Initial release // Revision 1.1 05/03/05 // Added fixed-radius kNN search //---------------------------------------------------------------------- #include "ANNx.h" // all ANN includes #include "pr_queue_k.h" // k element priority queue //---------------------------------------------------------------------- // Brute-force search simply stores a pointer to the list of // data points and searches linearly for the nearest neighbor. // The k nearest neighbors are stored in a k-element priority // queue (which is implemented in a pretty dumb way as well). // // If ANN_ALLOW_SELF_MATCH is ANNfalse then data points at distance // zero are not considered. // // Note that the error bound eps is passed in, but it is ignored. // These routines compute exact nearest neighbors (which is needed // for validation purposes in ann_test.cpp). //---------------------------------------------------------------------- ANNbruteForce::ANNbruteForce( // constructor from point array ANNpointArray pa, // point array int n, // number of points int dd) // dimension { dim = dd; n_pts = n; pts = pa; } ANNbruteForce::~ANNbruteForce() { } // destructor (empty) void ANNbruteForce::annkSearch( // approx k near neighbor search ANNpoint q, // query point int k, // number of near neighbors to return ANNidxArray nn_idx, // nearest neighbor indices (returned) ANNdistArray dd, // dist to near neighbors (returned) double eps) // error bound (ignored) { ANNmin_k mk(k); // construct a k-limited priority queue int i; if (k > n_pts) { // too many near neighbors? annError("Requesting more near neighbors than data points", ANNabort); } // run every point through queue for (i = 0; i < n_pts; i++) { // compute distance to point ANNdist sqDist = annDist(dim, pts[i], q); if (ANN_ALLOW_SELF_MATCH || sqDist != 0) mk.insert(sqDist, i); } for (i = 0; i < k; i++) { // extract the k closest points dd[i] = mk.ith_smallest_key(i); nn_idx[i] = mk.ith_smallest_info(i); } } int ANNbruteForce::annkFRSearch( // approx fixed-radius kNN search ANNpoint q, // query point ANNdist sqRad, // squared radius int k, // number of near neighbors to return ANNidxArray nn_idx, // nearest neighbor array (returned) ANNdistArray dd, // dist to near neighbors (returned) double eps) // error bound { ANNmin_k mk(k); // construct a k-limited priority queue int i; int pts_in_range = 0; // number of points in query range // run every point through queue for (i = 0; i < n_pts; i++) { // compute distance to point ANNdist sqDist = annDist(dim, pts[i], q); if (sqDist <= sqRad && // within radius bound (ANN_ALLOW_SELF_MATCH || sqDist != 0)) { // ...and no self match mk.insert(sqDist, i); pts_in_range++; } } for (i = 0; i < k; i++) { // extract the k closest points if (dd != NULL) dd[i] = mk.ith_smallest_key(i); if (nn_idx != NULL) nn_idx[i] = mk.ith_smallest_info(i); } return pts_in_range; } // MFH: version that returns all points std::pair< std::vector, std::vector > ANNbruteForce::annkFRSearch2( // approx fixed-radius kNN search ANNpoint q, // query point ANNdist sqRad, // squared radius double eps) // error bound { std::vector points; std::vector dists; int i; int pts_in_range = 0; // number of points in query range // run every point through queue for (i = 0; i < n_pts; i++) { // compute distance to point ANNdist sqDist = annDist(dim, pts[i], q); if (sqDist <= sqRad && // within radius bound (ANN_ALLOW_SELF_MATCH || sqDist != 0)) { // ...and no self match points.push_back(i); dists.push_back(sqDist); pts_in_range++; } } return std::make_pair(points, dists); } ================================================ FILE: src/ANN/kd_dump.cpp ================================================ //---------------------------------------------------------------------- // File: kd_dump.cc // Programmer: David Mount // Description: Dump and Load for kd- and bd-trees // Last modified: 01/04/05 (Version 1.0) //---------------------------------------------------------------------- // Copyright (c) 1997-2005 University of Maryland and Sunil Arya and // David Mount. All Rights Reserved. // // This software and related documentation is part of the Approximate // Nearest Neighbor Library (ANN). This software is provided under // the provisions of the Lesser GNU Public License (LGPL). See the // file ../ReadMe.txt for further information. // // The University of Maryland (U.M.) and the authors make no // representations about the suitability or fitness of this software for // any purpose. It is provided "as is" without express or implied // warranty. //---------------------------------------------------------------------- // History: // Revision 0.1 03/04/98 // Initial release // Revision 1.0 04/01/05 // Moved dump out of kd_tree.cc into this file. // Added kd-tree load constructor. // Revision 2/29/08 // added cstdlib and std:: along with cstdlib. and sting.h //---------------------------------------------------------------------- // This file contains routines for dumping kd-trees and bd-trees and // reloading them. (It is an abuse of policy to include both kd- and // bd-tree routines in the same file, sorry. There should be no problem // in deleting the bd- versions of the routines if they are not // desired.) //---------------------------------------------------------------------- #include #include #include //using namespace std; // make std:: available #include "kd_tree.h" // kd-tree declarations #include "bd_tree.h" // bd-tree declarations //---------------------------------------------------------------------- // Constants //---------------------------------------------------------------------- const int STRING_LEN = 500; // maximum string length // const double EPSILON = 1E-5; // small number for float comparison enum ANNtreeType {KD_TREE, BD_TREE}; // tree types (used in loading) //---------------------------------------------------------------------- // Procedure declarations //---------------------------------------------------------------------- static ANNkd_ptr annReadDump( // read dump file istream &in, // input stream ANNtreeType tree_type, // type of tree expected ANNpointArray &the_pts, // new points (if applic) ANNidxArray &the_pidx, // point indices (returned) int &the_dim, // dimension (returned) int &the_n_pts, // number of points (returned) int &the_bkt_size, // bucket size (returned) ANNpoint &the_bnd_box_lo, // low bounding point ANNpoint &the_bnd_box_hi); // high bounding point static ANNkd_ptr annReadTree( // read tree-part of dump file istream &in, // input stream ANNtreeType tree_type, // type of tree expected ANNidxArray the_pidx, // point indices (modified) int &next_idx); // next index (modified) //---------------------------------------------------------------------- // ANN kd- and bd-tree Dump Format // The dump file begins with a header containing the version of // ANN, an optional section containing the points, followed by // a description of the tree. The tree is printed in preorder. // // Format: // #ANN [END_OF_LINE] // points (point coordinates: this is optional) // 0 ... (point indices and coordinates) // 1 ... // ... // tree // ... (lower end of bounding box) // ... (upper end of bounding box) // If the tree is null, then a single line "null" is // output. Otherwise the nodes of the tree are printed // one per line in preorder. Leaves and splitting nodes // have the following formats: // Leaf node: // leaf ... // Splitting nodes: // split // // For bd-trees: // // Shrinking nodes: // shrink // // // ... (repeated n_bnds times) //---------------------------------------------------------------------- void ANNkd_tree::Dump( // dump entire tree ANNbool with_pts, // print points as well? ostream &out) // output stream { out << "#ANN " << ANNversion << "\n"; out.precision(ANNcoordPrec); // use full precision in dumping if (with_pts) { // print point coordinates out << "points " << dim << " " << n_pts << "\n"; for (int i = 0; i < n_pts; i++) { out << i << " "; annPrintPt(pts[i], dim, out); out << "\n"; } } out << "tree " // print tree elements << dim << " " << n_pts << " " << bkt_size << "\n"; annPrintPt(bnd_box_lo, dim, out); // print lower bound out << "\n"; annPrintPt(bnd_box_hi, dim, out); // print upper bound out << "\n"; if (root == NULL) // empty tree? out << "null\n"; else { root->dump(out); // invoke printing at root } out.precision(0); // restore default precision } void ANNkd_split::dump( // dump a splitting node ostream &out) // output stream { out << "split " << cut_dim << " " << cut_val << " "; out << cd_bnds[ANN_LO] << " " << cd_bnds[ANN_HI] << "\n"; child[ANN_LO]->dump(out); // print low child child[ANN_HI]->dump(out); // print high child } void ANNkd_leaf::dump( // dump a leaf node ostream &out) // output stream { if (this == KD_TRIVIAL) { // canonical trivial leaf node out << "leaf 0\n"; // leaf no points } else{ out << "leaf " << n_pts; for (int j = 0; j < n_pts; j++) { out << " " << bkt[j]; } out << "\n"; } } void ANNbd_shrink::dump( // dump a shrinking node ostream &out) // output stream { out << "shrink " << n_bnds << "\n"; for (int j = 0; j < n_bnds; j++) { out << bnds[j].cd << " " << bnds[j].cv << " " << bnds[j].sd << "\n"; } child[ANN_IN]->dump(out); // print in-child child[ANN_OUT]->dump(out); // print out-child } //---------------------------------------------------------------------- // Load kd-tree from dump file // This rebuilds a kd-tree which was dumped to a file. The dump // file contains all the basic tree information according to a // preorder traversal. We assume that the dump file also contains // point data. (This is to guarantee the consistency of the tree.) // If not, then an error is generated. // // Indirectly, this procedure allocates space for points, point // indices, all nodes in the tree, and the bounding box for the // tree. When the tree is destroyed, all but the points are // deallocated. // // This routine calls annReadDump to do all the work. //---------------------------------------------------------------------- ANNkd_tree::ANNkd_tree( // build from dump file istream &in) // input stream for dump file { int the_dim; // local dimension int the_n_pts; // local number of points int the_bkt_size; // local number of points ANNpoint the_bnd_box_lo; // low bounding point ANNpoint the_bnd_box_hi; // high bounding point ANNpointArray the_pts; // point storage ANNidxArray the_pidx; // point index storage ANNkd_ptr the_root; // root of the tree the_root = annReadDump( // read the dump file in, // input stream KD_TREE, // expecting a kd-tree the_pts, // point array (returned) the_pidx, // point indices (returned) the_dim, the_n_pts, the_bkt_size, // basic tree info (returned) the_bnd_box_lo, the_bnd_box_hi); // bounding box info (returned) // create a skeletal tree SkeletonTree(the_n_pts, the_dim, the_bkt_size, the_pts, the_pidx); bnd_box_lo = the_bnd_box_lo; bnd_box_hi = the_bnd_box_hi; root = the_root; // set the root } ANNbd_tree::ANNbd_tree( // build bd-tree from dump file istream &in) : ANNkd_tree() // input stream for dump file { int the_dim; // local dimension int the_n_pts; // local number of points int the_bkt_size; // local number of points ANNpoint the_bnd_box_lo; // low bounding point ANNpoint the_bnd_box_hi; // high bounding point ANNpointArray the_pts; // point storage ANNidxArray the_pidx; // point index storage ANNkd_ptr the_root; // root of the tree the_root = annReadDump( // read the dump file in, // input stream BD_TREE, // expecting a bd-tree the_pts, // point array (returned) the_pidx, // point indices (returned) the_dim, the_n_pts, the_bkt_size, // basic tree info (returned) the_bnd_box_lo, the_bnd_box_hi); // bounding box info (returned) // create a skeletal tree SkeletonTree(the_n_pts, the_dim, the_bkt_size, the_pts, the_pidx); bnd_box_lo = the_bnd_box_lo; bnd_box_hi = the_bnd_box_hi; root = the_root; // set the root } //---------------------------------------------------------------------- // annReadDump - read a dump file // // This procedure reads a dump file, constructs a kd-tree // and returns all the essential information needed to actually // construct the tree. Because this procedure is used for // constructing both kd-trees and bd-trees, the second argument // is used to indicate which type of tree we are expecting. //---------------------------------------------------------------------- static ANNkd_ptr annReadDump( istream &in, // input stream ANNtreeType tree_type, // type of tree expected ANNpointArray &the_pts, // new points (returned) ANNidxArray &the_pidx, // point indices (returned) int &the_dim, // dimension (returned) int &the_n_pts, // number of points (returned) int &the_bkt_size, // bucket size (returned) ANNpoint &the_bnd_box_lo, // low bounding point (ret'd) ANNpoint &the_bnd_box_hi) // high bounding point (ret'd) { int j; char str[STRING_LEN]; // storage for string char version[STRING_LEN]; // ANN version number ANNkd_ptr the_root = NULL; //------------------------------------------------------------------ // Input file header //------------------------------------------------------------------ in >> str; // input header if (strcmp(str, "#ANN") != 0) { // incorrect header annError("Incorrect header for dump file", ANNabort); } in.getline(version, STRING_LEN); // get version (ignore) //------------------------------------------------------------------ // Input the points // An array the_pts is allocated and points are read from // the dump file. //------------------------------------------------------------------ in >> str; // get major heading if (strcmp(str, "points") == 0) { // points section in >> the_dim; // input dimension in >> the_n_pts; // number of points // allocate point storage the_pts = annAllocPts(the_n_pts, the_dim); for (int i = 0; i < the_n_pts; i++) { // input point coordinates ANNidx idx; // point index in >> idx; // input point index if (idx < 0 || idx >= the_n_pts) { annError("Point index is out of range", ANNabort); } for (j = 0; j < the_dim; j++) { in >> the_pts[idx][j]; // read point coordinates } } in >> str; // get next major heading } else { // no points were input annError("Points must be supplied in the dump file", ANNabort); } //------------------------------------------------------------------ // Input the tree // After the basic header information, we invoke annReadTree // to do all the heavy work. We create our own array of // point indices (so we can pass them to annReadTree()) // but we do not deallocate them. They will be deallocated // when the tree is destroyed. //------------------------------------------------------------------ if (strcmp(str, "tree") == 0) { // tree section in >> the_dim; // read dimension in >> the_n_pts; // number of points in >> the_bkt_size; // bucket size the_bnd_box_lo = annAllocPt(the_dim); // allocate bounding box pts the_bnd_box_hi = annAllocPt(the_dim); for (j = 0; j < the_dim; j++) { // read bounding box low in >> the_bnd_box_lo[j]; } for (j = 0; j < the_dim; j++) { // read bounding box low in >> the_bnd_box_hi[j]; } the_pidx = new ANNidx[the_n_pts]; // allocate point index array int next_idx = 0; // number of indices filled // read the tree and indices the_root = annReadTree(in, tree_type, the_pidx, next_idx); if (next_idx != the_n_pts) { // didn't see all the points? annError("Didn't see as many points as expected", ANNwarn); } } else { annError("Illegal dump format. Expecting section heading", ANNabort); } return the_root; } //---------------------------------------------------------------------- // annReadTree - input tree and return pointer // // annReadTree reads in a node of the tree, makes any recursive // calls as needed to input the children of this node (if internal). // It returns a pointer to the node that was created. An array // of point indices is given along with a pointer to the next // available location in the array. As leaves are read, their // point indices are stored here, and the point buckets point // to the first entry in the array. // // Recall that these are the formats. The tree is given in // preorder. // // Leaf node: // leaf ... // Splitting nodes: // split // // For bd-trees: // // Shrinking nodes: // shrink // // // ... (repeated n_bnds times) //---------------------------------------------------------------------- static ANNkd_ptr annReadTree( istream &in, // input stream ANNtreeType tree_type, // type of tree expected ANNidxArray the_pidx, // point indices (modified) int &next_idx) // next index (modified) { char tag[STRING_LEN]; // tag (leaf, split, shrink) int n_pts; // number of points in leaf int cd; // cut dimension ANNcoord cv; // cut value ANNcoord lb; // low bound ANNcoord hb; // high bound int n_bnds; // number of bounding sides int sd; // which side in >> tag; // input node tag if (strcmp(tag, "null") == 0) { // null tree return NULL; } //------------------------------------------------------------------ // Read a leaf //------------------------------------------------------------------ if (strcmp(tag, "leaf") == 0) { // leaf node in >> n_pts; // input number of points int old_idx = next_idx; // save next_idx if (n_pts == 0) { // trivial leaf return KD_TRIVIAL; } else { for (int i = 0; i < n_pts; i++) { // input point indices in >> the_pidx[next_idx++]; // store in array of indices } } return new ANNkd_leaf(n_pts, &the_pidx[old_idx]); } //------------------------------------------------------------------ // Read a splitting node //------------------------------------------------------------------ else if (strcmp(tag, "split") == 0) { // splitting node in >> cd >> cv >> lb >> hb; // read low and high subtrees ANNkd_ptr lc = annReadTree(in, tree_type, the_pidx, next_idx); ANNkd_ptr hc = annReadTree(in, tree_type, the_pidx, next_idx); // create new node and return return new ANNkd_split(cd, cv, lb, hb, lc, hc); } //------------------------------------------------------------------ // Read a shrinking node (bd-tree only) //------------------------------------------------------------------ else if (strcmp(tag, "shrink") == 0) { // shrinking node if (tree_type != BD_TREE) { annError("Shrinking node not allowed in kd-tree", ANNabort); } in >> n_bnds; // number of bounding sides // allocate bounds array ANNorthHSArray bds = new ANNorthHalfSpace[n_bnds]; for (int i = 0; i < n_bnds; i++) { in >> cd >> cv >> sd; // input bounding halfspace // copy to array bds[i] = ANNorthHalfSpace(cd, cv, sd); } // read inner and outer subtrees ANNkd_ptr ic = annReadTree(in, tree_type, the_pidx, next_idx); ANNkd_ptr oc = annReadTree(in, tree_type, the_pidx, next_idx); // create new node and return return new ANNbd_shrink(n_bnds, bds, ic, oc); } else { annError("Illegal node type in dump file", ANNabort); //std::exit(0); // R objects... this approch to keep the compiler happy return NULL; // to keep the compiler happy } } ================================================ FILE: src/ANN/kd_fix_rad_search.cpp ================================================ //---------------------------------------------------------------------- // File: kd_fix_rad_search.cpp // Programmer: Sunil Arya and David Mount // Description: Standard kd-tree fixed-radius kNN search // Last modified: 05/03/05 (Version 1.1) //---------------------------------------------------------------------- // Copyright (c) 1997-2005 University of Maryland and Sunil Arya and // David Mount. All Rights Reserved. // // This software and related documentation is part of the Approximate // Nearest Neighbor Library (ANN). This software is provided under // the provisions of the Lesser GNU Public License (LGPL). See the // file ../ReadMe.txt for further information. // // The University of Maryland (U.M.) and the authors make no // representations about the suitability or fitness of this software for // any purpose. It is provided "as is" without express or implied // warranty. //---------------------------------------------------------------------- // History: // Revision 1.1 05/03/05 // Initial release //---------------------------------------------------------------------- // MFH: the code was changed to return all fixed radius neighbors using // a std::vector called closest. #include "kd_fix_rad_search.h" // kd fixed-radius search decls #include //---------------------------------------------------------------------- // Approximate fixed-radius k nearest neighbor search // The squared radius is provided, and this procedure finds the // k nearest neighbors within the radius, and returns the total // number of points lying within the radius. // // The method used for searching the kd-tree is a variation of the // nearest neighbor search used in kd_search.cpp, except that the // radius of the search ball is known. We refer the reader to that // file for the explanation of the recursive search procedure. //---------------------------------------------------------------------- //---------------------------------------------------------------------- // To keep argument lists short, a number of global variables // are maintained which are common to all the recursive calls. // These are given below. //---------------------------------------------------------------------- int ANNkdFRDim; // dimension of space ANNpoint ANNkdFRQ; // query point ANNdist ANNkdFRSqRad; // squared radius search bound double ANNkdFRMaxErr; // max tolerable squared error ANNpointArray ANNkdFRPts; // the points ANNmin_k* ANNkdFRPointMK; // set of k closest points std::vector closest; // MFH: set of all closest points std::vector dists; // MFH: set of all closest points int ANNkdFRPtsVisited; // total points visited int ANNkdFRPtsInRange; // number of points in the range //---------------------------------------------------------------------- // annkFRSearch - fixed radius search for k nearest neighbors //---------------------------------------------------------------------- // defunct we use ANNkd_tree::annkFRSearch2 which stores all neighbors in the new structures // closest and dist. int ANNkd_tree::annkFRSearch( ANNpoint q, // the query point ANNdist sqRad, // squared radius search bound int k, // number of near neighbors to return ANNidxArray nn_idx, // nearest neighbor indices (returned) ANNdistArray dd, // the approximate nearest neighbor double eps) // the error bound { ANNkdFRDim = dim; // copy arguments to static equivs ANNkdFRQ = q; ANNkdFRSqRad = sqRad; ANNkdFRPts = pts; ANNkdFRPtsVisited = 0; // initialize count of points visited ANNkdFRPtsInRange = 0; // ...and points in the range ANNkdFRMaxErr = ANN_POW(1.0 + eps); ANN_FLOP(2) // increment floating op count ANNkdFRPointMK = new ANNmin_k(k); // create set for closest k points // search starting at the root root->ann_FR_search(annBoxDistance(q, bnd_box_lo, bnd_box_hi, dim)); for (int i = 0; i < k; i++) { // extract the k-th closest points if (dd != NULL) dd[i] = ANNkdFRPointMK->ith_smallest_key(i); if (nn_idx != NULL) nn_idx[i] = ANNkdFRPointMK->ith_smallest_info(i); } delete ANNkdFRPointMK; // deallocate closest point set return ANNkdFRPtsInRange; // return final point count } // MFH this function returns all closest points std::pair< std::vector, std::vector > ANNkd_tree::annkFRSearch2( ANNpoint q, // the query point ANNdist sqRad, // squared radius search bound double eps) // the error bound { ANNkdFRDim = dim; // copy arguments to static equivs ANNkdFRQ = q; ANNkdFRSqRad = sqRad; ANNkdFRPts = pts; ANNkdFRPtsVisited = 0; // initialize count of points visited ANNkdFRPtsInRange = 0; // ...and points in the range ANNkdFRMaxErr = ANN_POW(1.0 + eps); ANN_FLOP(2) // increment floating op count //ANNkdFRPointMK = new ANNmin_k(k); // create set for closest k points closest.clear(); dists.clear(); // search starting at the root root->ann_FR_search(annBoxDistance(q, bnd_box_lo, bnd_box_hi, dim)); return std::make_pair(closest, dists); // return final point count } //---------------------------------------------------------------------- // kd_split::arch - search a splitting node // Note: This routine is similar in structure to the standard kNN // search. It visits the subtree that is closer to the query point // first. For fixed-radius search, there is no benefit in visiting // one subtree before the other, but we maintain the same basic // code structure for the sake of uniformity. //---------------------------------------------------------------------- void ANNkd_split::ann_FR_search(ANNdist box_dist) { // check dist calc term condition if (ANNmaxPtsVisited != 0 && ANNkdFRPtsVisited > ANNmaxPtsVisited) return; // distance to cutting plane ANNcoord cut_diff = ANNkdFRQ[cut_dim] - cut_val; if (cut_diff < 0) { // left of cutting plane child[ANN_LO]->ann_FR_search(box_dist);// visit closer child first ANNcoord box_diff = cd_bnds[ANN_LO] - ANNkdFRQ[cut_dim]; if (box_diff < 0) // within bounds - ignore box_diff = 0; // distance to further box box_dist = (ANNdist) ANN_SUM(box_dist, ANN_DIFF(ANN_POW(box_diff), ANN_POW(cut_diff))); // visit further child if in range if (box_dist * ANNkdFRMaxErr <= ANNkdFRSqRad) child[ANN_HI]->ann_FR_search(box_dist); } else { // right of cutting plane child[ANN_HI]->ann_FR_search(box_dist);// visit closer child first ANNcoord box_diff = ANNkdFRQ[cut_dim] - cd_bnds[ANN_HI]; if (box_diff < 0) // within bounds - ignore box_diff = 0; // distance to further box box_dist = (ANNdist) ANN_SUM(box_dist, ANN_DIFF(ANN_POW(box_diff), ANN_POW(cut_diff))); // visit further child if close enough if (box_dist * ANNkdFRMaxErr <= ANNkdFRSqRad) child[ANN_LO]->ann_FR_search(box_dist); } ANN_FLOP(13) // increment floating ops ANN_SPL(1) // one more splitting node visited } //---------------------------------------------------------------------- // kd_leaf::ann_FR_search - search points in a leaf node // Note: The unreadability of this code is the result of // some fine tuning to replace indexing by pointer operations. //---------------------------------------------------------------------- void ANNkd_leaf::ann_FR_search(ANNdist box_dist) { ANNdist dist; // distance to data point ANNcoord* pp; // data coordinate pointer ANNcoord* qq; // query coordinate pointer ANNcoord t; int d; for (int i = 0; i < n_pts; i++) { // check points in bucket pp = ANNkdFRPts[bkt[i]]; // first coord of next data point qq = ANNkdFRQ; // first coord of query point dist = 0; for(d = 0; d < ANNkdFRDim; d++) { ANN_COORD(1) // one more coordinate hit ANN_FLOP(5) // increment floating ops t = *(qq++) - *(pp++); // compute length and adv coordinate // exceeds dist to k-th smallest? if( (dist = ANN_SUM(dist, ANN_POW(t))) > ANNkdFRSqRad) { break; } } if (d >= ANNkdFRDim && // among the k best? (ANN_ALLOW_SELF_MATCH || dist!=0.0)) { // and no self-match problem // add it to the list //ANNkdFRPointMK->insert(dist, bkt[i]); // MFH closest.push_back(bkt[i]); dists.push_back(dist); ANNkdFRPtsInRange++; // increment point count } } ANN_LEAF(1) // one more leaf node visited ANN_PTS(n_pts) // increment points visited ANNkdFRPtsVisited += n_pts; // increment number of points visited } ================================================ FILE: src/ANN/kd_fix_rad_search.h ================================================ //---------------------------------------------------------------------- // File: kd_fix_rad_search.h // Programmer: Sunil Arya and David Mount // Description: Standard kd-tree fixed-radius kNN search // Last modified: ??/??/?? (Version 1.1) //---------------------------------------------------------------------- // Copyright (c) 1997-2005 University of Maryland and Sunil Arya and // David Mount. All Rights Reserved. // // This software and related documentation is part of the Approximate // Nearest Neighbor Library (ANN). This software is provided under // the provisions of the Lesser GNU Public License (LGPL). See the // file ../ReadMe.txt for further information. // // The University of Maryland (U.M.) and the authors make no // representations about the suitability or fitness of this software for // any purpose. It is provided "as is" without express or implied // warranty. //---------------------------------------------------------------------- // History: // Revision 1.1 ??/??/?? // Initial release //---------------------------------------------------------------------- #ifndef ANN_kd_fix_rad_search_H #define ANN_kd_fix_rad_search_H #include "kd_tree.h" // kd-tree declarations #include "kd_util.h" // kd-tree utilities #include "pr_queue_k.h" // k-element priority queue #include "ANNperf.h" // performance evaluation //---------------------------------------------------------------------- // Global variables // These are active for the life of each call to // annRangeSearch(). They are set to save the number of // variables that need to be passed among the various search // procedures. //---------------------------------------------------------------------- extern ANNpoint ANNkdFRQ; // query point (static copy) #endif ================================================ FILE: src/ANN/kd_pr_search.cpp ================================================ //---------------------------------------------------------------------- // File: kd_pr_search.cpp // Programmer: Sunil Arya and David Mount // Description: Priority search for kd-trees // Last modified: 01/04/05 (Version 1.0) //---------------------------------------------------------------------- // Copyright (c) 1997-2005 University of Maryland and Sunil Arya and // David Mount. All Rights Reserved. // // This software and related documentation is part of the Approximate // Nearest Neighbor Library (ANN). This software is provided under // the provisions of the Lesser GNU Public License (LGPL). See the // file ../ReadMe.txt for further information. // // The University of Maryland (U.M.) and the authors make no // representations about the suitability or fitness of this software for // any purpose. It is provided "as is" without express or implied // warranty. //---------------------------------------------------------------------- // History: // Revision 0.1 03/04/98 // Initial release //---------------------------------------------------------------------- #include "kd_pr_search.h" // kd priority search declarations //---------------------------------------------------------------------- // Approximate nearest neighbor searching by priority search. // The kd-tree is searched for an approximate nearest neighbor. // The point is returned through one of the arguments, and the // distance returned is the SQUARED distance to this point. // // The method used for searching the kd-tree is called priority // search. (It is described in Arya and Mount, ``Algorithms for // fast vector quantization,'' Proc. of DCC '93: Data Compression // Conference}, eds. J. A. Storer and M. Cohn, IEEE Press, 1993, // 381--390.) // // The cell of the kd-tree containing the query point is located, // and cells are visited in increasing order of distance from the // query point. This is done by placing each subtree which has // NOT been visited in a priority queue, according to the closest // distance of the corresponding enclosing rectangle from the // query point. The search stops when the distance to the nearest // remaining rectangle exceeds the distance to the nearest point // seen by a factor of more than 1/(1+eps). (Implying that any // point found subsequently in the search cannot be closer by more // than this factor.) // // The main entry point is annkPriSearch() which sets things up and // then call the recursive routine ann_pri_search(). This is a // recursive routine which performs the processing for one node in // the kd-tree. There are two versions of this virtual procedure, // one for splitting nodes and one for leaves. When a splitting node // is visited, we determine which child to continue the search on // (the closer one), and insert the other child into the priority // queue. When a leaf is visited, we compute the distances to the // points in the buckets, and update information on the closest // points. // // Some trickery is used to incrementally update the distance from // a kd-tree rectangle to the query point. This comes about from // the fact that which each successive split, only one component // (along the dimension that is split) of the squared distance to // the child rectangle is different from the squared distance to // the parent rectangle. //---------------------------------------------------------------------- //---------------------------------------------------------------------- // To keep argument lists short, a number of global variables // are maintained which are common to all the recursive calls. // These are given below. //---------------------------------------------------------------------- double ANNprEps; // the error bound int ANNprDim; // dimension of space ANNpoint ANNprQ; // query point double ANNprMaxErr; // max tolerable squared error ANNpointArray ANNprPts; // the points ANNpr_queue *ANNprBoxPQ; // priority queue for boxes ANNmin_k *ANNprPointMK; // set of k closest points //---------------------------------------------------------------------- // annkPriSearch - priority search for k nearest neighbors //---------------------------------------------------------------------- void ANNkd_tree::annkPriSearch( ANNpoint q, // query point int k, // number of near neighbors to return ANNidxArray nn_idx, // nearest neighbor indices (returned) ANNdistArray dd, // dist to near neighbors (returned) double eps) // error bound (ignored) { // max tolerable squared error ANNprMaxErr = ANN_POW(1.0 + eps); ANN_FLOP(2) // increment floating ops ANNprDim = dim; // copy arguments to static equivs ANNprQ = q; ANNprPts = pts; ANNptsVisited = 0; // initialize count of points visited ANNprPointMK = new ANNmin_k(k); // create set for closest k points // distance to root box ANNdist box_dist = annBoxDistance(q, bnd_box_lo, bnd_box_hi, dim); ANNprBoxPQ = new ANNpr_queue(n_pts);// create priority queue for boxes ANNprBoxPQ->insert(box_dist, root); // insert root in priority queue while (ANNprBoxPQ->non_empty() && (!(ANNmaxPtsVisited != 0 && ANNptsVisited > ANNmaxPtsVisited))) { ANNkd_ptr np; // next box from prior queue // extract closest box from queue ANNprBoxPQ->extr_min(box_dist, (void *&) np); ANN_FLOP(2) // increment floating ops if (box_dist*ANNprMaxErr >= ANNprPointMK->max_key()) break; np->ann_pri_search(box_dist); // search this subtree. } for (int i = 0; i < k; i++) { // extract the k-th closest points dd[i] = ANNprPointMK->ith_smallest_key(i); nn_idx[i] = ANNprPointMK->ith_smallest_info(i); } delete ANNprPointMK; // deallocate closest point set delete ANNprBoxPQ; // deallocate priority queue } //---------------------------------------------------------------------- // kd_split::ann_pri_search - search a splitting node //---------------------------------------------------------------------- void ANNkd_split::ann_pri_search(ANNdist box_dist) { ANNdist new_dist; // distance to child visited later // distance to cutting plane ANNcoord cut_diff = ANNprQ[cut_dim] - cut_val; if (cut_diff < 0) { // left of cutting plane ANNcoord box_diff = cd_bnds[ANN_LO] - ANNprQ[cut_dim]; if (box_diff < 0) // within bounds - ignore box_diff = 0; // distance to further box new_dist = (ANNdist) ANN_SUM(box_dist, ANN_DIFF(ANN_POW(box_diff), ANN_POW(cut_diff))); if (child[ANN_HI] != KD_TRIVIAL)// enqueue if not trivial ANNprBoxPQ->insert(new_dist, child[ANN_HI]); // continue with closer child child[ANN_LO]->ann_pri_search(box_dist); } else { // right of cutting plane ANNcoord box_diff = ANNprQ[cut_dim] - cd_bnds[ANN_HI]; if (box_diff < 0) // within bounds - ignore box_diff = 0; // distance to further box new_dist = (ANNdist) ANN_SUM(box_dist, ANN_DIFF(ANN_POW(box_diff), ANN_POW(cut_diff))); if (child[ANN_LO] != KD_TRIVIAL)// enqueue if not trivial ANNprBoxPQ->insert(new_dist, child[ANN_LO]); // continue with closer child child[ANN_HI]->ann_pri_search(box_dist); } ANN_SPL(1) // one more splitting node visited ANN_FLOP(8) // increment floating ops } //---------------------------------------------------------------------- // kd_leaf::ann_pri_search - search points in a leaf node // // This is virtually identical to the ann_search for standard search. //---------------------------------------------------------------------- void ANNkd_leaf::ann_pri_search(ANNdist box_dist) { ANNdist dist; // distance to data point ANNcoord* pp; // data coordinate pointer ANNcoord* qq; // query coordinate pointer ANNdist min_dist; // distance to k-th closest point ANNcoord t; int d; min_dist = ANNprPointMK->max_key(); // k-th smallest distance so far for (int i = 0; i < n_pts; i++) { // check points in bucket pp = ANNprPts[bkt[i]]; // first coord of next data point qq = ANNprQ; // first coord of query point dist = 0; for(d = 0; d < ANNprDim; d++) { ANN_COORD(1) // one more coordinate hit ANN_FLOP(4) // increment floating ops t = *(qq++) - *(pp++); // compute length and adv coordinate // exceeds dist to k-th smallest? if( (dist = ANN_SUM(dist, ANN_POW(t))) > min_dist) { break; } } if (d >= ANNprDim && // among the k best? (ANN_ALLOW_SELF_MATCH || dist!=0)) { // and no self-match problem // add it to the list ANNprPointMK->insert(dist, bkt[i]); min_dist = ANNprPointMK->max_key(); } } ANN_LEAF(1) // one more leaf node visited ANN_PTS(n_pts) // increment points visited ANNptsVisited += n_pts; // increment number of points visited } ================================================ FILE: src/ANN/kd_pr_search.h ================================================ //---------------------------------------------------------------------- // File: kd_pr_search.h // Programmer: Sunil Arya and David Mount // Description: Priority kd-tree search // Last modified: 01/04/05 (Version 1.0) //---------------------------------------------------------------------- // Copyright (c) 1997-2005 University of Maryland and Sunil Arya and // David Mount. All Rights Reserved. // // This software and related documentation is part of the Approximate // Nearest Neighbor Library (ANN). This software is provided under // the provisions of the Lesser GNU Public License (LGPL). See the // file ../ReadMe.txt for further information. // // The University of Maryland (U.M.) and the authors make no // representations about the suitability or fitness of this software for // any purpose. It is provided "as is" without express or implied // warranty. //---------------------------------------------------------------------- // History: // Revision 0.1 03/04/98 // Initial release //---------------------------------------------------------------------- #ifndef ANN_kd_pr_search_H #define ANN_kd_pr_search_H #include "kd_tree.h" // kd-tree declarations #include "kd_util.h" // kd-tree utilities #include "pr_queue.h" // priority queue declarations #include "pr_queue_k.h" // k-element priority queue #include "ANNperf.h" // performance evaluation //---------------------------------------------------------------------- // Global variables // Active for the life of each call to Appx_Near_Neigh() or // Appx_k_Near_Neigh(). //---------------------------------------------------------------------- extern double ANNprEps; // the error bound extern int ANNprDim; // dimension of space extern ANNpoint ANNprQ; // query point extern double ANNprMaxErr; // max tolerable squared error extern ANNpointArray ANNprPts; // the points extern ANNpr_queue *ANNprBoxPQ; // priority queue for boxes extern ANNmin_k *ANNprPointMK; // set of k closest points #endif ================================================ FILE: src/ANN/kd_search.cpp ================================================ //---------------------------------------------------------------------- // File: kd_search.cpp // Programmer: Sunil Arya and David Mount // Description: Standard kd-tree search // Last modified: 01/04/05 (Version 1.0) //---------------------------------------------------------------------- // Copyright (c) 1997-2005 University of Maryland and Sunil Arya and // David Mount. All Rights Reserved. // // This software and related documentation is part of the Approximate // Nearest Neighbor Library (ANN). This software is provided under // the provisions of the Lesser GNU Public License (LGPL). See the // file ../ReadMe.txt for further information. // // The University of Maryland (U.M.) and the authors make no // representations about the suitability or fitness of this software for // any purpose. It is provided "as is" without express or implied // warranty. //---------------------------------------------------------------------- // History: // Revision 0.1 03/04/98 // Initial release // Revision 1.0 04/01/05 // Changed names LO, HI to ANN_LO, ANN_HI //---------------------------------------------------------------------- #include "kd_search.h" // kd-search declarations //---------------------------------------------------------------------- // Approximate nearest neighbor searching by kd-tree search // The kd-tree is searched for an approximate nearest neighbor. // The point is returned through one of the arguments, and the // distance returned is the squared distance to this point. // // The method used for searching the kd-tree is an approximate // adaptation of the search algorithm described by Friedman, // Bentley, and Finkel, ``An algorithm for finding best matches // in logarithmic expected time,'' ACM Transactions on Mathematical // Software, 3(3):209-226, 1977). // // The algorithm operates recursively. When first encountering a // node of the kd-tree we first visit the child which is closest to // the query point. On return, we decide whether we want to visit // the other child. If the box containing the other child exceeds // 1/(1+eps) times the current best distance, then we skip it (since // any point found in this child cannot be closer to the query point // by more than this factor.) Otherwise, we visit it recursively. // The distance between a box and the query point is computed exactly // (not approximated as is often done in kd-tree), using incremental // distance updates, as described by Arya and Mount in ``Algorithms // for fast vector quantization,'' Proc. of DCC '93: Data Compression // Conference, eds. J. A. Storer and M. Cohn, IEEE Press, 1993, // 381-390. // // The main entry points is annkSearch() which sets things up and // then call the recursive routine ann_search(). This is a recursive // routine which performs the processing for one node in the kd-tree. // There are two versions of this virtual procedure, one for splitting // nodes and one for leaves. When a splitting node is visited, we // determine which child to visit first (the closer one), and visit // the other child on return. When a leaf is visited, we compute // the distances to the points in the buckets, and update information // on the closest points. // // Some trickery is used to incrementally update the distance from // a kd-tree rectangle to the query point. This comes about from // the fact that which each successive split, only one component // (along the dimension that is split) of the squared distance to // the child rectangle is different from the squared distance to // the parent rectangle. //---------------------------------------------------------------------- //---------------------------------------------------------------------- // To keep argument lists short, a number of global variables // are maintained which are common to all the recursive calls. // These are given below. //---------------------------------------------------------------------- int ANNkdDim; // dimension of space ANNpoint ANNkdQ; // query point double ANNkdMaxErr; // max tolerable squared error ANNpointArray ANNkdPts; // the points ANNmin_k *ANNkdPointMK; // set of k closest points //---------------------------------------------------------------------- // annkSearch - search for the k nearest neighbors //---------------------------------------------------------------------- void ANNkd_tree::annkSearch( ANNpoint q, // the query point int k, // number of near neighbors to return ANNidxArray nn_idx, // nearest neighbor indices (returned) ANNdistArray dd, // the approximate nearest neighbor double eps) // the error bound { ANNkdDim = dim; // copy arguments to static equivs ANNkdQ = q; ANNkdPts = pts; ANNptsVisited = 0; // initialize count of points visited if (k > n_pts) { // too many near neighbors? annError("Requesting more near neighbors than data points", ANNabort); } ANNkdMaxErr = ANN_POW(1.0 + eps); ANN_FLOP(2) // increment floating op count ANNkdPointMK = new ANNmin_k(k); // create set for closest k points // search starting at the root root->ann_search(annBoxDistance(q, bnd_box_lo, bnd_box_hi, dim)); for (int i = 0; i < k; i++) { // extract the k-th closest points dd[i] = ANNkdPointMK->ith_smallest_key(i); nn_idx[i] = ANNkdPointMK->ith_smallest_info(i); } delete ANNkdPointMK; // deallocate closest point set } //---------------------------------------------------------------------- // kd_split::ann_search - search a splitting node //---------------------------------------------------------------------- void ANNkd_split::ann_search(ANNdist box_dist) { // check dist calc term condition if (ANNmaxPtsVisited != 0 && ANNptsVisited > ANNmaxPtsVisited) return; // distance to cutting plane ANNcoord cut_diff = ANNkdQ[cut_dim] - cut_val; if (cut_diff < 0) { // left of cutting plane child[ANN_LO]->ann_search(box_dist);// visit closer child first ANNcoord box_diff = cd_bnds[ANN_LO] - ANNkdQ[cut_dim]; if (box_diff < 0) // within bounds - ignore box_diff = 0; // distance to further box box_dist = (ANNdist) ANN_SUM(box_dist, ANN_DIFF(ANN_POW(box_diff), ANN_POW(cut_diff))); // visit further child if close enough if (box_dist * ANNkdMaxErr < ANNkdPointMK->max_key()) child[ANN_HI]->ann_search(box_dist); } else { // right of cutting plane child[ANN_HI]->ann_search(box_dist);// visit closer child first ANNcoord box_diff = ANNkdQ[cut_dim] - cd_bnds[ANN_HI]; if (box_diff < 0) // within bounds - ignore box_diff = 0; // distance to further box box_dist = (ANNdist) ANN_SUM(box_dist, ANN_DIFF(ANN_POW(box_diff), ANN_POW(cut_diff))); // visit further child if close enough if (box_dist * ANNkdMaxErr < ANNkdPointMK->max_key()) child[ANN_LO]->ann_search(box_dist); } ANN_FLOP(10) // increment floating ops ANN_SPL(1) // one more splitting node visited } //---------------------------------------------------------------------- // kd_leaf::ann_search - search points in a leaf node // Note: The unreadability of this code is the result of // some fine tuning to replace indexing by pointer operations. //---------------------------------------------------------------------- void ANNkd_leaf::ann_search(ANNdist box_dist) { ANNdist dist; // distance to data point ANNcoord* pp; // data coordinate pointer ANNcoord* qq; // query coordinate pointer ANNdist min_dist; // distance to k-th closest point ANNcoord t; int d; min_dist = ANNkdPointMK->max_key(); // k-th smallest distance so far for (int i = 0; i < n_pts; i++) { // check points in bucket pp = ANNkdPts[bkt[i]]; // first coord of next data point qq = ANNkdQ; // first coord of query point dist = 0; for(d = 0; d < ANNkdDim; d++) { ANN_COORD(1) // one more coordinate hit ANN_FLOP(4) // increment floating ops t = *(qq++) - *(pp++); // compute length and adv coordinate // exceeds dist to k-th smallest? if( (dist = ANN_SUM(dist, ANN_POW(t))) > min_dist) { break; } } if (d >= ANNkdDim && // among the k best? (ANN_ALLOW_SELF_MATCH || dist!=0)) { // and no self-match problem // add it to the list ANNkdPointMK->insert(dist, bkt[i]); min_dist = ANNkdPointMK->max_key(); } } ANN_LEAF(1) // one more leaf node visited ANN_PTS(n_pts) // increment points visited ANNptsVisited += n_pts; // increment number of points visited } ================================================ FILE: src/ANN/kd_search.h ================================================ //---------------------------------------------------------------------- // File: kd_search.h // Programmer: Sunil Arya and David Mount // Description: Standard kd-tree search // Last modified: 01/04/05 (Version 1.0) //---------------------------------------------------------------------- // Copyright (c) 1997-2005 University of Maryland and Sunil Arya and // David Mount. All Rights Reserved. // // This software and related documentation is part of the Approximate // Nearest Neighbor Library (ANN). This software is provided under // the provisions of the Lesser GNU Public License (LGPL). See the // file ../ReadMe.txt for further information. // // The University of Maryland (U.M.) and the authors make no // representations about the suitability or fitness of this software for // any purpose. It is provided "as is" without express or implied // warranty. //---------------------------------------------------------------------- // History: // Revision 0.1 03/04/98 // Initial release //---------------------------------------------------------------------- #ifndef ANN_kd_search_H #define ANN_kd_search_H #include "kd_tree.h" // kd-tree declarations #include "kd_util.h" // kd-tree utilities #include "pr_queue_k.h" // k-element priority queue #include "ANNperf.h" // performance evaluation //---------------------------------------------------------------------- // More global variables // These are active for the life of each call to annkSearch(). They // are set to save the number of variables that need to be passed // among the various search procedures. //---------------------------------------------------------------------- extern int ANNkdDim; // dimension of space (static copy) extern ANNpoint ANNkdQ; // query point (static copy) extern double ANNkdMaxErr; // max tolerable squared error extern ANNpointArray ANNkdPts; // the points (static copy) extern ANNmin_k *ANNkdPointMK; // set of k closest points extern int ANNptsVisited; // number of points visited #endif ================================================ FILE: src/ANN/kd_split.cpp ================================================ //---------------------------------------------------------------------- // File: kd_split.cpp // Programmer: Sunil Arya and David Mount // Description: Methods for splitting kd-trees // Last modified: 01/04/05 (Version 1.0) //---------------------------------------------------------------------- // Copyright (c) 1997-2005 University of Maryland and Sunil Arya and // David Mount. All Rights Reserved. // // This software and related documentation is part of the Approximate // Nearest Neighbor Library (ANN). This software is provided under // the provisions of the Lesser GNU Public License (LGPL). See the // file ../ReadMe.txt for further information. // // The University of Maryland (U.M.) and the authors make no // representations about the suitability or fitness of this software for // any purpose. It is provided "as is" without express or implied // warranty. //---------------------------------------------------------------------- // History: // Revision 0.1 03/04/98 // Initial release // Revision 1.0 04/01/05 //---------------------------------------------------------------------- #include "kd_tree.h" // kd-tree definitions #include "kd_util.h" // kd-tree utilities #include "kd_split.h" // splitting functions //---------------------------------------------------------------------- // Constants //---------------------------------------------------------------------- const double EPS = 0.001; // a small value const double FS_ASPECT_RATIO = 3.0; // maximum allowed aspect ratio // in fair split. Must be >= 2. //---------------------------------------------------------------------- // kd_split - Bentley's standard splitting routine for kd-trees // Find the dimension of the greatest spread, and split // just before the median point along this dimension. //---------------------------------------------------------------------- void kd_split( ANNpointArray pa, // point array (permuted on return) ANNidxArray pidx, // point indices const ANNorthRect &bnds, // bounding rectangle for cell int n, // number of points int dim, // dimension of space int &cut_dim, // cutting dimension (returned) ANNcoord &cut_val, // cutting value (returned) int &n_lo) // num of points on low side (returned) { // find dimension of maximum spread cut_dim = annMaxSpread(pa, pidx, n, dim); n_lo = n/2; // median rank // split about median annMedianSplit(pa, pidx, n, cut_dim, cut_val, n_lo); } //---------------------------------------------------------------------- // midpt_split - midpoint splitting rule for box-decomposition trees // // This is the simplest splitting rule that guarantees boxes // of bounded aspect ratio. It simply cuts the box with the // longest side through its midpoint. If there are ties, it // selects the dimension with the maximum point spread. // // WARNING: This routine (while simple) doesn't seem to work // well in practice in high dimensions, because it tends to // generate a large number of trivial and/or unbalanced splits. // Either kd_split(), sl_midpt_split(), or fair_split() are // recommended, instead. //---------------------------------------------------------------------- void midpt_split( ANNpointArray pa, // point array ANNidxArray pidx, // point indices (permuted on return) const ANNorthRect &bnds, // bounding rectangle for cell int n, // number of points int dim, // dimension of space int &cut_dim, // cutting dimension (returned) ANNcoord &cut_val, // cutting value (returned) int &n_lo) // num of points on low side (returned) { int d; ANNcoord max_length = bnds.hi[0] - bnds.lo[0]; for (d = 1; d < dim; d++) { // find length of longest box side ANNcoord length = bnds.hi[d] - bnds.lo[d]; if (length > max_length) { max_length = length; } } ANNcoord max_spread = -1; // find long side with most spread for (d = 0; d < dim; d++) { // is it among longest? if (double(bnds.hi[d] - bnds.lo[d]) >= (1-EPS)*max_length) { // compute its spread ANNcoord spr = annSpread(pa, pidx, n, d); if (spr > max_spread) { // is it max so far? max_spread = spr; cut_dim = d; } } } // split along cut_dim at midpoint cut_val = (bnds.lo[cut_dim] + bnds.hi[cut_dim]) / 2; // permute points accordingly int br1, br2; annPlaneSplit(pa, pidx, n, cut_dim, cut_val, br1, br2); //------------------------------------------------------------------ // On return: pa[0..br1-1] < cut_val // pa[br1..br2-1] == cut_val // pa[br2..n-1] > cut_val // // We can set n_lo to any value in the range [br1..br2]. // We choose split so that points are most evenly divided. //------------------------------------------------------------------ if (br1 > n/2) n_lo = br1; else if (br2 < n/2) n_lo = br2; else n_lo = n/2; } //---------------------------------------------------------------------- // sl_midpt_split - sliding midpoint splitting rule // // This is a modification of midpt_split, which has the nonsensical // name "sliding midpoint". The idea is that we try to use the // midpoint rule, by bisecting the longest side. If there are // ties, the dimension with the maximum spread is selected. If, // however, the midpoint split produces a trivial split (no points // on one side of the splitting plane) then we slide the splitting // (maintaining its orientation) until it produces a nontrivial // split. For example, if the splitting plane is along the x-axis, // and all the data points have x-coordinate less than the x-bisector, // then the split is taken along the maximum x-coordinate of the // data points. // // Intuitively, this rule cannot generate trivial splits, and // hence avoids midpt_split's tendency to produce trees with // a very large number of nodes. // //---------------------------------------------------------------------- void sl_midpt_split( ANNpointArray pa, // point array ANNidxArray pidx, // point indices (permuted on return) const ANNorthRect &bnds, // bounding rectangle for cell int n, // number of points int dim, // dimension of space int &cut_dim, // cutting dimension (returned) ANNcoord &cut_val, // cutting value (returned) int &n_lo) // num of points on low side (returned) { int d; ANNcoord max_length = bnds.hi[0] - bnds.lo[0]; for (d = 1; d < dim; d++) { // find length of longest box side ANNcoord length = bnds.hi[d] - bnds.lo[d]; if (length > max_length) { max_length = length; } } ANNcoord max_spread = -1; // find long side with most spread for (d = 0; d < dim; d++) { // is it among longest? if ((bnds.hi[d] - bnds.lo[d]) >= (1-EPS)*max_length) { // compute its spread ANNcoord spr = annSpread(pa, pidx, n, d); if (spr > max_spread) { // is it max so far? max_spread = spr; cut_dim = d; } } } // ideal split at midpoint ANNcoord ideal_cut_val = (bnds.lo[cut_dim] + bnds.hi[cut_dim])/2; ANNcoord min, max; annMinMax(pa, pidx, n, cut_dim, min, max); // find min/max coordinates if (ideal_cut_val < min) // slide to min or max as needed cut_val = min; else if (ideal_cut_val > max) cut_val = max; else cut_val = ideal_cut_val; // permute points accordingly int br1, br2; annPlaneSplit(pa, pidx, n, cut_dim, cut_val, br1, br2); //------------------------------------------------------------------ // On return: pa[0..br1-1] < cut_val // pa[br1..br2-1] == cut_val // pa[br2..n-1] > cut_val // // We can set n_lo to any value in the range [br1..br2] to satisfy // the exit conditions of the procedure. // // if ideal_cut_val < min (implying br2 >= 1), // then we select n_lo = 1 (so there is one point on left) and // if ideal_cut_val > max (implying br1 <= n-1), // then we select n_lo = n-1 (so there is one point on right). // Otherwise, we select n_lo as close to n/2 as possible within // [br1..br2]. //------------------------------------------------------------------ if (ideal_cut_val < min) n_lo = 1; else if (ideal_cut_val > max) n_lo = n-1; else if (br1 > n/2) n_lo = br1; else if (br2 < n/2) n_lo = br2; else n_lo = n/2; } //---------------------------------------------------------------------- // fair_split - fair-split splitting rule // // This is a compromise between the kd-tree splitting rule (which // always splits data points at their median) and the midpoint // splitting rule (which always splits a box through its center. // The goal of this procedure is to achieve both nicely balanced // splits, and boxes of bounded aspect ratio. // // A constant FS_ASPECT_RATIO is defined. Given a box, those sides // which can be split so that the ratio of the longest to shortest // side does not exceed ASPECT_RATIO are identified. Among these // sides, we select the one in which the points have the largest // spread. We then split the points in a manner which most evenly // distributes the points on either side of the splitting plane, // subject to maintaining the bound on the ratio of long to short // sides. To determine that the aspect ratio will be preserved, // we determine the longest side (other than this side), and // determine how narrowly we can cut this side, without causing the // aspect ratio bound to be exceeded (small_piece). // // This procedure is more robust than either kd_split or midpt_split, // but is more complicated as well. When point distribution is // extremely skewed, this degenerates to midpt_split (actually // 1/3 point split), and when the points are most evenly distributed, // this degenerates to kd-split. //---------------------------------------------------------------------- void fair_split( ANNpointArray pa, // point array ANNidxArray pidx, // point indices (permuted on return) const ANNorthRect &bnds, // bounding rectangle for cell int n, // number of points int dim, // dimension of space int &cut_dim, // cutting dimension (returned) ANNcoord &cut_val, // cutting value (returned) int &n_lo) // num of points on low side (returned) { int d; ANNcoord max_length = bnds.hi[0] - bnds.lo[0]; cut_dim = 0; for (d = 1; d < dim; d++) { // find length of longest box side ANNcoord length = bnds.hi[d] - bnds.lo[d]; if (length > max_length) { max_length = length; cut_dim = d; } } ANNcoord max_spread = 0; // find legal cut with max spread cut_dim = 0; for (d = 0; d < dim; d++) { ANNcoord length = bnds.hi[d] - bnds.lo[d]; // is this side midpoint splitable // without violating aspect ratio? if (((double) max_length)*2.0/((double) length) <= FS_ASPECT_RATIO) { // compute spread along this dim ANNcoord spr = annSpread(pa, pidx, n, d); if (spr > max_spread) { // best spread so far max_spread = spr; cut_dim = d; // this is dimension to cut } } } max_length = 0; // find longest side other than cut_dim for (d = 0; d < dim; d++) { ANNcoord length = bnds.hi[d] - bnds.lo[d]; if (d != cut_dim && length > max_length) max_length = length; } // consider most extreme splits ANNcoord small_piece = max_length / FS_ASPECT_RATIO; ANNcoord lo_cut = bnds.lo[cut_dim] + small_piece;// lowest legal cut ANNcoord hi_cut = bnds.hi[cut_dim] - small_piece;// highest legal cut int br1, br2; // is median below lo_cut ? if (annSplitBalance(pa, pidx, n, cut_dim, lo_cut) >= 0) { cut_val = lo_cut; // cut at lo_cut annPlaneSplit(pa, pidx, n, cut_dim, cut_val, br1, br2); n_lo = br1; } // is median above hi_cut? else if (annSplitBalance(pa, pidx, n, cut_dim, hi_cut) <= 0) { cut_val = hi_cut; // cut at hi_cut annPlaneSplit(pa, pidx, n, cut_dim, cut_val, br1, br2); n_lo = br2; } else { // median cut preserves asp ratio n_lo = n/2; // split about median annMedianSplit(pa, pidx, n, cut_dim, cut_val, n_lo); } } //---------------------------------------------------------------------- // sl_fair_split - sliding fair split splitting rule // // Sliding fair split is a splitting rule that combines the // strengths of both fair split with sliding midpoint split. // Fair split tends to produce balanced splits when the points // are roughly uniformly distributed, but it can produce many // trivial splits when points are highly clustered. Sliding // midpoint never produces trivial splits, and shrinks boxes // nicely if points are highly clustered, but it may produce // rather unbalanced splits when points are unclustered but not // quite uniform. // // Sliding fair split is based on the theory that there are two // types of splits that are "good": balanced splits that produce // fat boxes, and unbalanced splits provided the cell with fewer // points is fat. // // This splitting rule operates by first computing the longest // side of the current bounding box. Then it asks which sides // could be split (at the midpoint) and still satisfy the aspect // ratio bound with respect to this side. Among these, it selects // the side with the largest spread (as fair split would). It // then considers the most extreme cuts that would be allowed by // the aspect ratio bound. This is done by dividing the longest // side of the box by the aspect ratio bound. If the median cut // lies between these extreme cuts, then we use the median cut. // If not, then consider the extreme cut that is closer to the // median. If all the points lie to one side of this cut, then // we slide the cut until it hits the first point. This may // violate the aspect ratio bound, but will never generate empty // cells. However the sibling of every such skinny cell is fat, // and hence packing arguments still apply. // //---------------------------------------------------------------------- void sl_fair_split( ANNpointArray pa, // point array ANNidxArray pidx, // point indices (permuted on return) const ANNorthRect &bnds, // bounding rectangle for cell int n, // number of points int dim, // dimension of space int &cut_dim, // cutting dimension (returned) ANNcoord &cut_val, // cutting value (returned) int &n_lo) // num of points on low side (returned) { int d; ANNcoord min, max; // min/max coordinates int br1, br2; // split break points ANNcoord max_length = bnds.hi[0] - bnds.lo[0]; cut_dim = 0; for (d = 1; d < dim; d++) { // find length of longest box side ANNcoord length = bnds.hi[d] - bnds.lo[d]; if (length > max_length) { max_length = length; cut_dim = d; } } ANNcoord max_spread = 0; // find legal cut with max spread cut_dim = 0; for (d = 0; d < dim; d++) { ANNcoord length = bnds.hi[d] - bnds.lo[d]; // is this side midpoint splitable // without violating aspect ratio? if (((double) max_length)*2.0/((double) length) <= FS_ASPECT_RATIO) { // compute spread along this dim ANNcoord spr = annSpread(pa, pidx, n, d); if (spr > max_spread) { // best spread so far max_spread = spr; cut_dim = d; // this is dimension to cut } } } max_length = 0; // find longest side other than cut_dim for (d = 0; d < dim; d++) { ANNcoord length = bnds.hi[d] - bnds.lo[d]; if (d != cut_dim && length > max_length) max_length = length; } // consider most extreme splits ANNcoord small_piece = max_length / FS_ASPECT_RATIO; ANNcoord lo_cut = bnds.lo[cut_dim] + small_piece;// lowest legal cut ANNcoord hi_cut = bnds.hi[cut_dim] - small_piece;// highest legal cut // find min and max along cut_dim annMinMax(pa, pidx, n, cut_dim, min, max); // is median below lo_cut? if (annSplitBalance(pa, pidx, n, cut_dim, lo_cut) >= 0) { if (max > lo_cut) { // are any points above lo_cut? cut_val = lo_cut; // cut at lo_cut annPlaneSplit(pa, pidx, n, cut_dim, cut_val, br1, br2); n_lo = br1; // balance if there are ties } else { // all points below lo_cut cut_val = max; // cut at max value annPlaneSplit(pa, pidx, n, cut_dim, cut_val, br1, br2); n_lo = n-1; } } // is median above hi_cut? else if (annSplitBalance(pa, pidx, n, cut_dim, hi_cut) <= 0) { if (min < hi_cut) { // are any points below hi_cut? cut_val = hi_cut; // cut at hi_cut annPlaneSplit(pa, pidx, n, cut_dim, cut_val, br1, br2); n_lo = br2; // balance if there are ties } else { // all points above hi_cut cut_val = min; // cut at min value annPlaneSplit(pa, pidx, n, cut_dim, cut_val, br1, br2); n_lo = 1; } } else { // median cut is good enough n_lo = n/2; // split about median annMedianSplit(pa, pidx, n, cut_dim, cut_val, n_lo); } } ================================================ FILE: src/ANN/kd_split.h ================================================ //---------------------------------------------------------------------- // File: kd_split.h // Programmer: Sunil Arya and David Mount // Description: Methods for splitting kd-trees // Last modified: 01/04/05 (Version 1.0) //---------------------------------------------------------------------- // Copyright (c) 1997-2005 University of Maryland and Sunil Arya and // David Mount. All Rights Reserved. // // This software and related documentation is part of the Approximate // Nearest Neighbor Library (ANN). This software is provided under // the provisions of the Lesser GNU Public License (LGPL). See the // file ../ReadMe.txt for further information. // // The University of Maryland (U.M.) and the authors make no // representations about the suitability or fitness of this software for // any purpose. It is provided "as is" without express or implied // warranty. //---------------------------------------------------------------------- // History: // Revision 0.1 03/04/98 // Initial release //---------------------------------------------------------------------- #ifndef ANN_KD_SPLIT_H #define ANN_KD_SPLIT_H #include "kd_tree.h" // kd-tree definitions //---------------------------------------------------------------------- // External entry points // These are all splitting procedures for kd-trees. //---------------------------------------------------------------------- void kd_split( // standard (optimized) kd-splitter ANNpointArray pa, // point array (unaltered) ANNidxArray pidx, // point indices (permuted on return) const ANNorthRect &bnds, // bounding rectangle for cell int n, // number of points int dim, // dimension of space int &cut_dim, // cutting dimension (returned) ANNcoord &cut_val, // cutting value (returned) int &n_lo); // num of points on low side (returned) void midpt_split( // midpoint kd-splitter ANNpointArray pa, // point array (unaltered) ANNidxArray pidx, // point indices (permuted on return) const ANNorthRect &bnds, // bounding rectangle for cell int n, // number of points int dim, // dimension of space int &cut_dim, // cutting dimension (returned) ANNcoord &cut_val, // cutting value (returned) int &n_lo); // num of points on low side (returned) void sl_midpt_split( // sliding midpoint kd-splitter ANNpointArray pa, // point array (unaltered) ANNidxArray pidx, // point indices (permuted on return) const ANNorthRect &bnds, // bounding rectangle for cell int n, // number of points int dim, // dimension of space int &cut_dim, // cutting dimension (returned) ANNcoord &cut_val, // cutting value (returned) int &n_lo); // num of points on low side (returned) void fair_split( // fair-split kd-splitter ANNpointArray pa, // point array (unaltered) ANNidxArray pidx, // point indices (permuted on return) const ANNorthRect &bnds, // bounding rectangle for cell int n, // number of points int dim, // dimension of space int &cut_dim, // cutting dimension (returned) ANNcoord &cut_val, // cutting value (returned) int &n_lo); // num of points on low side (returned) void sl_fair_split( // sliding fair-split kd-splitter ANNpointArray pa, // point array (unaltered) ANNidxArray pidx, // point indices (permuted on return) const ANNorthRect &bnds, // bounding rectangle for cell int n, // number of points int dim, // dimension of space int &cut_dim, // cutting dimension (returned) ANNcoord &cut_val, // cutting value (returned) int &n_lo); // num of points on low side (returned) #endif ================================================ FILE: src/ANN/kd_tree.cpp ================================================ //---------------------------------------------------------------------- // File: kd_tree.cpp // Programmer: Sunil Arya and David Mount // Description: Basic methods for kd-trees. // Last modified: 01/04/05 (Version 1.0) //---------------------------------------------------------------------- // Copyright (c) 1997-2005 University of Maryland and Sunil Arya and // David Mount. All Rights Reserved. // // This software and related documentation is part of the Approximate // Nearest Neighbor Library (ANN). This software is provided under // the provisions of the Lesser GNU Public License (LGPL). See the // file ../ReadMe.txt for further information. // // The University of Maryland (U.M.) and the authors make no // representations about the suitability or fitness of this software for // any purpose. It is provided "as is" without express or implied // warranty. //---------------------------------------------------------------------- // History: // Revision 0.1 03/04/98 // Initial release // Revision 1.0 04/01/05 // Increased aspect ratio bound (ANN_AR_TOOBIG) from 100 to 1000. // Fixed leaf counts to count trivial leaves. // Added optional pa, pi arguments to Skeleton kd_tree constructor // for use in load constructor. // Added annClose() to eliminate KD_TRIVIAL memory leak. //---------------------------------------------------------------------- #include "kd_tree.h" // kd-tree declarations #include "kd_split.h" // kd-tree splitting rules #include "kd_util.h" // kd-tree utilities #include "ANNperf.h" // performance evaluation //---------------------------------------------------------------------- // Global data // // For some splitting rules, especially with small bucket sizes, // it is possible to generate a large number of empty leaf nodes. // To save storage we allocate a single trivial leaf node which // contains no points. For messy coding reasons it is convenient // to have it reference a trivial point index. // // KD_TRIVIAL is allocated when the first kd-tree is created. It // must *never* deallocated (since it may be shared by more than // one tree). //---------------------------------------------------------------------- static int IDX_TRIVIAL[] = {0}; // trivial point index ANNkd_leaf *KD_TRIVIAL = NULL; // trivial leaf node //---------------------------------------------------------------------- // Printing the kd-tree // These routines print a kd-tree in reverse inorder (high then // root then low). (This is so that if you look at the output // from the right side it appear from left to right in standard // inorder.) When outputting leaves we output only the point // indices rather than the point coordinates. There is an option // to print the point coordinates separately. // // The tree printing routine calls the printing routines on the // individual nodes of the tree, passing in the level or depth // in the tree. The level in the tree is used to print indentation // for readability. //---------------------------------------------------------------------- void ANNkd_split::print( // print splitting node int level, // depth of node in tree ostream &out) // output stream { child[ANN_HI]->print(level+1, out); // print high child out << " "; for (int i = 0; i < level; i++) // print indentation out << ".."; out << "Split cd=" << cut_dim << " cv=" << cut_val; out << " lbnd=" << cd_bnds[ANN_LO]; out << " hbnd=" << cd_bnds[ANN_HI]; out << "\n"; child[ANN_LO]->print(level+1, out); // print low child } void ANNkd_leaf::print( // print leaf node int level, // depth of node in tree ostream &out) // output stream { out << " "; for (int i = 0; i < level; i++) // print indentation out << ".."; if (this == KD_TRIVIAL) { // canonical trivial leaf node out << "Leaf (trivial)\n"; } else{ out << "Leaf n=" << n_pts << " <"; for (int j = 0; j < n_pts; j++) { out << bkt[j]; if (j < n_pts-1) out << ","; } out << ">\n"; } } void ANNkd_tree::Print( // print entire tree ANNbool with_pts, // print points as well? ostream &out) // output stream { out << "ANN Version " << ANNversion << "\n"; if (with_pts) { // print point coordinates out << " Points:\n"; for (int i = 0; i < n_pts; i++) { out << "\t" << i << ": "; annPrintPt(pts[i], dim, out); out << "\n"; } } if (root == NULL) // empty tree? out << " Null tree.\n"; else { root->print(0, out); // invoke printing at root } } //---------------------------------------------------------------------- // kd_tree statistics (for performance evaluation) // This routine compute various statistics information for // a kd-tree. It is used by the implementors for performance // evaluation of the data structure. //---------------------------------------------------------------------- #define MAX(a,b) ((a) > (b) ? (a) : (b)) void ANNkdStats::merge(const ANNkdStats &st) // merge stats from child { n_lf += st.n_lf; n_tl += st.n_tl; n_spl += st.n_spl; n_shr += st.n_shr; depth = MAX(depth, st.depth); sum_ar += st.sum_ar; } //---------------------------------------------------------------------- // Update statistics for nodes //---------------------------------------------------------------------- const double ANN_AR_TOOBIG = 1000; // too big an aspect ratio void ANNkd_leaf::getStats( // get subtree statistics int dim, // dimension of space ANNkdStats &st, // stats (modified) ANNorthRect &bnd_box) // bounding box { st.reset(); st.n_lf = 1; // count this leaf if (this == KD_TRIVIAL) st.n_tl = 1; // count trivial leaf double ar = annAspectRatio(dim, bnd_box); // aspect ratio of leaf // incr sum (ignore outliers) st.sum_ar += float(ar < ANN_AR_TOOBIG ? ar : ANN_AR_TOOBIG); } void ANNkd_split::getStats( // get subtree statistics int dim, // dimension of space ANNkdStats &st, // stats (modified) ANNorthRect &bnd_box) // bounding box { ANNkdStats ch_stats; // stats for children // get stats for low child ANNcoord hv = bnd_box.hi[cut_dim]; // save box bounds bnd_box.hi[cut_dim] = cut_val; // upper bound for low child ch_stats.reset(); // reset child[ANN_LO]->getStats(dim, ch_stats, bnd_box); st.merge(ch_stats); // merge them bnd_box.hi[cut_dim] = hv; // restore bound // get stats for high child ANNcoord lv = bnd_box.lo[cut_dim]; // save box bounds bnd_box.lo[cut_dim] = cut_val; // lower bound for high child ch_stats.reset(); // reset child[ANN_HI]->getStats(dim, ch_stats, bnd_box); st.merge(ch_stats); // merge them bnd_box.lo[cut_dim] = lv; // restore bound st.depth++; // increment depth st.n_spl++; // increment number of splits } //---------------------------------------------------------------------- // getStats // Collects a number of statistics related to kd_tree or // bd_tree. //---------------------------------------------------------------------- void ANNkd_tree::getStats( // get tree statistics ANNkdStats &st) // stats (modified) { st.reset(dim, n_pts, bkt_size); // reset stats // create bounding box ANNorthRect bnd_box(dim, bnd_box_lo, bnd_box_hi); if (root != NULL) { // if nonempty tree root->getStats(dim, st, bnd_box); // get statistics st.avg_ar = st.sum_ar / st.n_lf; // average leaf asp ratio } } //---------------------------------------------------------------------- // kd_tree destructor // The destructor just frees the various elements that were // allocated in the construction process. //---------------------------------------------------------------------- ANNkd_tree::~ANNkd_tree() // tree destructor { if (root != NULL) delete root; if (pidx != NULL) delete [] pidx; if (bnd_box_lo != NULL) annDeallocPt(bnd_box_lo); if (bnd_box_hi != NULL) annDeallocPt(bnd_box_hi); } //---------------------------------------------------------------------- // This is called with all use of ANN is finished. It eliminates the // minor memory leak caused by the allocation of KD_TRIVIAL. //---------------------------------------------------------------------- void annClose() // close use of ANN { if (KD_TRIVIAL != NULL) { delete KD_TRIVIAL; KD_TRIVIAL = NULL; } } //---------------------------------------------------------------------- // kd_tree constructors // There is a skeleton kd-tree constructor which sets up a // trivial empty tree. The last optional argument allows // the routine to be passed a point index array which is // assumed to be of the proper size (n). Otherwise, one is // allocated and initialized to the identity. Warning: In // either case the destructor will deallocate this array. // // As a kludge, we need to allocate KD_TRIVIAL if one has not // already been allocated. (This is because I'm too dumb to // figure out how to cause a pointer to be allocated at load // time.) //---------------------------------------------------------------------- void ANNkd_tree::SkeletonTree( // construct skeleton tree int n, // number of points int dd, // dimension int bs, // bucket size ANNpointArray pa, // point array ANNidxArray pi) // point indices { dim = dd; // initialize basic elements n_pts = n; bkt_size = bs; pts = pa; // initialize points array root = NULL; // no associated tree yet if (pi == NULL) { // point indices provided? pidx = new ANNidx[n]; // no, allocate space for point indices for (int i = 0; i < n; i++) { pidx[i] = i; // initially identity } } else { pidx = pi; // yes, use them } bnd_box_lo = bnd_box_hi = NULL; // bounding box is nonexistent if (KD_TRIVIAL == NULL) // no trivial leaf node yet? KD_TRIVIAL = new ANNkd_leaf(0, IDX_TRIVIAL); // allocate it } ANNkd_tree::ANNkd_tree( // basic constructor int n, // number of points int dd, // dimension int bs) // bucket size { SkeletonTree(n, dd, bs); } // construct skeleton tree //---------------------------------------------------------------------- // rkd_tree - recursive procedure to build a kd-tree // // Builds a kd-tree for points in pa as indexed through the // array pidx[0..n-1] (typically a subarray of the array used in // the top-level call). This routine permutes the array pidx, // but does not alter pa[]. // // The construction is based on a standard algorithm for constructing // the kd-tree (see Friedman, Bentley, and Finkel, ``An algorithm for // finding best matches in logarithmic expected time,'' ACM Transactions // on Mathematical Software, 3(3):209-226, 1977). The procedure // operates by a simple divide-and-conquer strategy, which determines // an appropriate orthogonal cutting plane (see below), and splits // the points. When the number of points falls below the bucket size, // we simply store the points in a leaf node's bucket. // // One of the arguments is a pointer to a splitting routine, // whose prototype is: // // void split( // ANNpointArray pa, // complete point array // ANNidxArray pidx, // point array (permuted on return) // ANNorthRect &bnds, // bounds of current cell // int n, // number of points // int dim, // dimension of space // int &cut_dim, // cutting dimension // ANNcoord &cut_val, // cutting value // int &n_lo) // no. of points on low side of cut // // This procedure selects a cutting dimension and cutting value, // partitions pa about these values, and returns the number of // points on the low side of the cut. //---------------------------------------------------------------------- ANNkd_ptr rkd_tree( // recursive construction of kd-tree ANNpointArray pa, // point array ANNidxArray pidx, // point indices to store in subtree int n, // number of points int dim, // dimension of space int bsp, // bucket space ANNorthRect &bnd_box, // bounding box for current node ANNkd_splitter splitter) // splitting routine { if (n <= bsp) { // n small, make a leaf node if (n == 0) // empty leaf node return KD_TRIVIAL; // return (canonical) empty leaf else // construct the node and return return new ANNkd_leaf(n, pidx); } else { // n large, make a splitting node int cd; // cutting dimension ANNcoord cv; // cutting value int n_lo; // number on low side of cut ANNkd_node *lo, *hi; // low and high children // invoke splitting procedure (*splitter)(pa, pidx, bnd_box, n, dim, cd, cv, n_lo); ANNcoord lv = bnd_box.lo[cd]; // save bounds for cutting dimension ANNcoord hv = bnd_box.hi[cd]; bnd_box.hi[cd] = cv; // modify bounds for left subtree lo = rkd_tree( // build left subtree pa, pidx, n_lo, // ...from pidx[0..n_lo-1] dim, bsp, bnd_box, splitter); bnd_box.hi[cd] = hv; // restore bounds bnd_box.lo[cd] = cv; // modify bounds for right subtree hi = rkd_tree( // build right subtree pa, pidx + n_lo, n-n_lo,// ...from pidx[n_lo..n-1] dim, bsp, bnd_box, splitter); bnd_box.lo[cd] = lv; // restore bounds // create the splitting node ANNkd_split *ptr = new ANNkd_split(cd, cv, lv, hv, lo, hi); return ptr; // return pointer to this node } } //---------------------------------------------------------------------- // kd-tree constructor // This is the main constructor for kd-trees given a set of points. // It first builds a skeleton tree, then computes the bounding box // of the data points, and then invokes rkd_tree() to actually // build the tree, passing it the appropriate splitting routine. //---------------------------------------------------------------------- ANNkd_tree::ANNkd_tree( // construct from point array ANNpointArray pa, // point array (with at least n pts) int n, // number of points int dd, // dimension int bs, // bucket size ANNsplitRule split) // splitting method { SkeletonTree(n, dd, bs); // set up the basic stuff pts = pa; // where the points are if (n == 0) return; // no points--no sweat ANNorthRect bnd_box(dd); // bounding box for points annEnclRect(pa, pidx, n, dd, bnd_box);// construct bounding rectangle // copy to tree structure bnd_box_lo = annCopyPt(dd, bnd_box.lo); bnd_box_hi = annCopyPt(dd, bnd_box.hi); switch (split) { // build by rule case ANN_KD_STD: // standard kd-splitting rule root = rkd_tree(pa, pidx, n, dd, bs, bnd_box, kd_split); break; case ANN_KD_MIDPT: // midpoint split root = rkd_tree(pa, pidx, n, dd, bs, bnd_box, midpt_split); break; case ANN_KD_FAIR: // fair split root = rkd_tree(pa, pidx, n, dd, bs, bnd_box, fair_split); break; case ANN_KD_SUGGEST: // best (in our opinion) case ANN_KD_SL_MIDPT: // sliding midpoint split root = rkd_tree(pa, pidx, n, dd, bs, bnd_box, sl_midpt_split); break; case ANN_KD_SL_FAIR: // sliding fair split root = rkd_tree(pa, pidx, n, dd, bs, bnd_box, sl_fair_split); break; default: annError("Illegal splitting method", ANNabort); } } ================================================ FILE: src/ANN/kd_tree.h ================================================ //---------------------------------------------------------------------- // File: kd_tree.h // Programmer: Sunil Arya and David Mount // Description: Declarations for standard kd-tree routines // Last modified: 05/03/05 (Version 1.1) //---------------------------------------------------------------------- // Copyright (c) 1997-2005 University of Maryland and Sunil Arya and // David Mount. All Rights Reserved. // // This software and related documentation is part of the Approximate // Nearest Neighbor Library (ANN). This software is provided under // the provisions of the Lesser GNU Public License (LGPL). See the // file ../ReadMe.txt for further information. // // The University of Maryland (U.M.) and the authors make no // representations about the suitability or fitness of this software for // any purpose. It is provided "as is" without express or implied // warranty. //---------------------------------------------------------------------- // History: // Revision 0.1 03/04/98 // Initial release // Revision 1.1 05/03/05 // Added fixed radius kNN search //---------------------------------------------------------------------- #ifndef ANN_kd_tree_H #define ANN_kd_tree_H #include "ANNx.h" // all ANN includes using namespace std; // make std:: available //---------------------------------------------------------------------- // Generic kd-tree node // // Nodes in kd-trees are of two types, splitting nodes which contain // splitting information (a splitting hyperplane orthogonal to one // of the coordinate axes) and leaf nodes which contain point // information (an array of points stored in a bucket). This is // handled by making a generic class kd_node, which is essentially an // empty shell, and then deriving the leaf and splitting nodes from // this. //---------------------------------------------------------------------- class ANNkd_node{ // generic kd-tree node (empty shell) public: virtual ~ANNkd_node() {} // virtual distroyer virtual void ann_search(ANNdist) = 0; // tree search virtual void ann_pri_search(ANNdist) = 0; // priority search virtual void ann_FR_search(ANNdist) = 0; // fixed-radius search virtual void getStats( // get tree statistics int dim, // dimension of space ANNkdStats &st, // statistics ANNorthRect &bnd_box) = 0; // bounding box // print node virtual void print(int level, ostream &out) = 0; virtual void dump(ostream &out) = 0; // dump node friend class ANNkd_tree; // allow kd-tree to access us }; //---------------------------------------------------------------------- // kd-splitting function: // kd_splitter is a pointer to a splitting routine for preprocessing. // Different splitting procedures result in different strategies // for building the tree. //---------------------------------------------------------------------- typedef void (*ANNkd_splitter)( // splitting routine for kd-trees ANNpointArray pa, // point array (unaltered) ANNidxArray pidx, // point indices (permuted on return) const ANNorthRect &bnds, // bounding rectangle for cell int n, // number of points int dim, // dimension of space int &cut_dim, // cutting dimension (returned) ANNcoord &cut_val, // cutting value (returned) int &n_lo); // num of points on low side (returned) //---------------------------------------------------------------------- // Leaf kd-tree node // Leaf nodes of the kd-tree store the set of points associated // with this bucket, stored as an array of point indices. These // are indices in the array points, which resides with the // root of the kd-tree. We also store the number of points // that reside in this bucket. //---------------------------------------------------------------------- class ANNkd_leaf: public ANNkd_node // leaf node for kd-tree { int n_pts; // no. points in bucket ANNidxArray bkt; // bucket of points public: ANNkd_leaf( // constructor int n, // number of points ANNidxArray b) // bucket { n_pts = n; // number of points in bucket bkt = b; // the bucket } ~ANNkd_leaf() { } // destructor (none) virtual void getStats( // get tree statistics int dim, // dimension of space ANNkdStats &st, // statistics ANNorthRect &bnd_box); // bounding box virtual void print(int level, ostream &out);// print node virtual void dump(ostream &out); // dump node virtual void ann_search(ANNdist); // standard search virtual void ann_pri_search(ANNdist); // priority search virtual void ann_FR_search(ANNdist); // fixed-radius search }; //---------------------------------------------------------------------- // KD_TRIVIAL is a special pointer to an empty leaf node. Since // some splitting rules generate many (more than 50%) trivial // leaves, we use this one shared node to save space. // // The pointer is initialized to NULL, but whenever a kd-tree is // created, we allocate this node, if it has not already been // allocated. This node is *never* deallocated, so it produces // a small memory leak. //---------------------------------------------------------------------- extern ANNkd_leaf *KD_TRIVIAL; // trivial (empty) leaf node //---------------------------------------------------------------------- // kd-tree splitting node. // Splitting nodes contain a cutting dimension and a cutting value. // These indicate the axis-parellel plane which subdivide the // box for this node. The extent of the bounding box along the // cutting dimension is maintained (this is used to speed up point // to box distance calculations) [we do not store the entire bounding // box since this may be wasteful of space in high dimensions]. // We also store pointers to the 2 children. //---------------------------------------------------------------------- class ANNkd_split : public ANNkd_node // splitting node of a kd-tree { int cut_dim; // dim orthogonal to cutting plane ANNcoord cut_val; // location of cutting plane ANNcoord cd_bnds[2]; // lower and upper bounds of // rectangle along cut_dim ANNkd_ptr child[2]; // left and right children public: ANNkd_split( // constructor int cd, // cutting dimension ANNcoord cv, // cutting value ANNcoord lv, ANNcoord hv, // low and high values ANNkd_ptr lc=NULL, ANNkd_ptr hc=NULL) // children { cut_dim = cd; // cutting dimension cut_val = cv; // cutting value cd_bnds[ANN_LO] = lv; // lower bound for rectangle cd_bnds[ANN_HI] = hv; // upper bound for rectangle child[ANN_LO] = lc; // left child child[ANN_HI] = hc; // right child } ~ANNkd_split() // destructor { if (child[ANN_LO]!= NULL && child[ANN_LO]!= KD_TRIVIAL) delete child[ANN_LO]; if (child[ANN_HI]!= NULL && child[ANN_HI]!= KD_TRIVIAL) delete child[ANN_HI]; } virtual void getStats( // get tree statistics int dim, // dimension of space ANNkdStats &st, // statistics ANNorthRect &bnd_box); // bounding box virtual void print(int level, ostream &out);// print node virtual void dump(ostream &out); // dump node virtual void ann_search(ANNdist); // standard search virtual void ann_pri_search(ANNdist); // priority search virtual void ann_FR_search(ANNdist); // fixed-radius search }; //---------------------------------------------------------------------- // External entry points //---------------------------------------------------------------------- ANNkd_ptr rkd_tree( // recursive construction of kd-tree ANNpointArray pa, // point array (unaltered) ANNidxArray pidx, // point indices to store in subtree int n, // number of points int dim, // dimension of space int bsp, // bucket space ANNorthRect &bnd_box, // bounding box for current node ANNkd_splitter splitter); // splitting routine #endif ================================================ FILE: src/ANN/kd_util.cpp ================================================ //---------------------------------------------------------------------- // File: kd_util.cpp // Programmer: Sunil Arya and David Mount // Description: Common utilities for kd-trees // Last modified: 01/04/05 (Version 1.0) //---------------------------------------------------------------------- // Copyright (c) 1997-2005 University of Maryland and Sunil Arya and // David Mount. All Rights Reserved. // // This software and related documentation is part of the Approximate // Nearest Neighbor Library (ANN). This software is provided under // the provisions of the Lesser GNU Public License (LGPL). See the // file ../ReadMe.txt for further information. // // The University of Maryland (U.M.) and the authors make no // representations about the suitability or fitness of this software for // any purpose. It is provided "as is" without express or implied // warranty. //---------------------------------------------------------------------- // History: // Revision 0.1 03/04/98 // Initial release //---------------------------------------------------------------------- #include "kd_util.h" // kd-utility declarations #include "ANNperf.h" // performance evaluation //---------------------------------------------------------------------- // The following routines are utility functions for manipulating // points sets, used in determining splitting planes for kd-tree // construction. //---------------------------------------------------------------------- //---------------------------------------------------------------------- // NOTE: Virtually all point indexing is done through an index (i.e. // permutation) array pidx. Consequently, a reference to the d-th // coordinate of the i-th point is pa[pidx[i]][d]. The macro PA(i,d) // is a shorthand for this. //---------------------------------------------------------------------- // standard 2-d indirect indexing #define PA(i,d) (pa[pidx[(i)]][(d)]) // accessing a single point #define PP(i) (pa[pidx[(i)]]) //---------------------------------------------------------------------- // annAspectRatio // Compute the aspect ratio (ratio of longest to shortest side) // of a rectangle. //---------------------------------------------------------------------- double annAspectRatio( int dim, // dimension const ANNorthRect &bnd_box) // bounding cube { ANNcoord length = bnd_box.hi[0] - bnd_box.lo[0]; ANNcoord min_length = length; // min side length ANNcoord max_length = length; // max side length for (int d = 0; d < dim; d++) { length = bnd_box.hi[d] - bnd_box.lo[d]; if (length < min_length) min_length = length; if (length > max_length) max_length = length; } return max_length/min_length; } //---------------------------------------------------------------------- // annEnclRect, annEnclCube // These utilities compute the smallest rectangle and cube enclosing // a set of points, respectively. //---------------------------------------------------------------------- void annEnclRect( ANNpointArray pa, // point array ANNidxArray pidx, // point indices int n, // number of points int dim, // dimension ANNorthRect &bnds) // bounding cube (returned) { for (int d = 0; d < dim; d++) { // find smallest enclosing rectangle ANNcoord lo_bnd = PA(0,d); // lower bound on dimension d ANNcoord hi_bnd = PA(0,d); // upper bound on dimension d for (int i = 0; i < n; i++) { if (PA(i,d) < lo_bnd) lo_bnd = PA(i,d); else if (PA(i,d) > hi_bnd) hi_bnd = PA(i,d); } bnds.lo[d] = lo_bnd; bnds.hi[d] = hi_bnd; } } void annEnclCube( // compute smallest enclosing cube ANNpointArray pa, // point array ANNidxArray pidx, // point indices int n, // number of points int dim, // dimension ANNorthRect &bnds) // bounding cube (returned) { int d; // compute smallest enclosing rect annEnclRect(pa, pidx, n, dim, bnds); ANNcoord max_len = 0; // max length of any side for (d = 0; d < dim; d++) { // determine max side length ANNcoord len = bnds.hi[d] - bnds.lo[d]; if (len > max_len) { // update max_len if longest max_len = len; } } for (d = 0; d < dim; d++) { // grow sides to match max ANNcoord len = bnds.hi[d] - bnds.lo[d]; ANNcoord half_diff = (max_len - len) / 2; bnds.lo[d] -= half_diff; bnds.hi[d] += half_diff; } } //---------------------------------------------------------------------- // annBoxDistance - utility routine which computes distance from point to // box (Note: most distances to boxes are computed using incremental // distance updates, not this function.) //---------------------------------------------------------------------- ANNdist annBoxDistance( // compute distance from point to box const ANNpoint q, // the point const ANNpoint lo, // low point of box const ANNpoint hi, // high point of box int dim) // dimension of space { ANNdist dist = 0.0; // sum of squared distances ANNdist t; for (int d = 0; d < dim; d++) { if (q[d] < lo[d]) { // q is left of box t = ANNdist(lo[d]) - ANNdist(q[d]); dist = ANN_SUM(dist, ANN_POW(t)); } else if (q[d] > hi[d]) { // q is right of box t = ANNdist(q[d]) - ANNdist(hi[d]); dist = ANN_SUM(dist, ANN_POW(t)); } } ANN_FLOP(4*dim) // increment floating op count return dist; } //---------------------------------------------------------------------- // annSpread - find spread along given dimension // annMinMax - find min and max coordinates along given dimension // annMaxSpread - find dimension of max spread //---------------------------------------------------------------------- ANNcoord annSpread( // compute point spread along dimension ANNpointArray pa, // point array ANNidxArray pidx, // point indices int n, // number of points int d) // dimension to check { ANNcoord min = PA(0,d); // compute max and min coords ANNcoord max = PA(0,d); for (int i = 1; i < n; i++) { ANNcoord c = PA(i,d); if (c < min) min = c; else if (c > max) max = c; } return (max - min); // total spread is difference } void annMinMax( // compute min and max coordinates along dim ANNpointArray pa, // point array ANNidxArray pidx, // point indices int n, // number of points int d, // dimension to check ANNcoord &min, // minimum value (returned) ANNcoord &max) // maximum value (returned) { min = PA(0,d); // compute max and min coords max = PA(0,d); for (int i = 1; i < n; i++) { ANNcoord c = PA(i,d); if (c < min) min = c; else if (c > max) max = c; } } int annMaxSpread( // compute dimension of max spread ANNpointArray pa, // point array ANNidxArray pidx, // point indices int n, // number of points int dim) // dimension of space { int max_dim = 0; // dimension of max spread ANNcoord max_spr = 0; // amount of max spread if (n == 0) return max_dim; // no points, who cares? for (int d = 0; d < dim; d++) { // compute spread along each dim ANNcoord spr = annSpread(pa, pidx, n, d); if (spr > max_spr) { // bigger than current max max_spr = spr; max_dim = d; } } return max_dim; } //---------------------------------------------------------------------- // annMedianSplit - split point array about its median // Splits a subarray of points pa[0..n] about an element of given // rank (median: n_lo = n/2) with respect to dimension d. It places // the element of rank n_lo-1 correctly (because our splitting rule // takes the mean of these two). On exit, the array is permuted so // that: // // pa[0..n_lo-2][d] <= pa[n_lo-1][d] <= pa[n_lo][d] <= pa[n_lo+1..n-1][d]. // // The mean of pa[n_lo-1][d] and pa[n_lo][d] is returned as the // splitting value. // // All indexing is done indirectly through the index array pidx. // // This function uses the well known selection algorithm due to // C.A.R. Hoare. //---------------------------------------------------------------------- // swap two points in pa array #define PASWAP(a,b) { int tmp = pidx[a]; pidx[a] = pidx[b]; pidx[b] = tmp; } void annMedianSplit( ANNpointArray pa, // points to split ANNidxArray pidx, // point indices int n, // number of points int d, // dimension along which to split ANNcoord &cv, // cutting value int n_lo) // split into n_lo and n-n_lo { int l = 0; // left end of current subarray int r = n-1; // right end of current subarray while (l < r) { int i = (r+l)/2; // select middle as pivot int k; if (PA(i,d) > PA(r,d)) // make sure last > pivot PASWAP(i,r) PASWAP(l,i); // move pivot to first position ANNcoord c = PA(l,d); // pivot value i = l; k = r; for(;;) { // pivot about c while (PA(++i,d) < c) ; while (PA(--k,d) > c) ; if (i < k) PASWAP(i,k) else break; } PASWAP(l,k); // pivot winds up in location k if (k > n_lo) r = k-1; // recurse on proper subarray else if (k < n_lo) l = k+1; else break; // got the median exactly } if (n_lo > 0) { // search for next smaller item ANNcoord c = PA(0,d); // candidate for max int k = 0; // candidate's index for (int i = 1; i < n_lo; i++) { if (PA(i,d) > c) { c = PA(i,d); k = i; } } PASWAP(n_lo-1, k); // max among pa[0..n_lo-1] to pa[n_lo-1] } // cut value is midpoint value cv = (PA(n_lo-1,d) + PA(n_lo,d))/2.0; } //---------------------------------------------------------------------- // annPlaneSplit - split point array about a cutting plane // Split the points in an array about a given plane along a // given cutting dimension. On exit, br1 and br2 are set so // that: // // pa[ 0 ..br1-1] < cv // pa[br1..br2-1] == cv // pa[br2.. n -1] > cv // // All indexing is done indirectly through the index array pidx. // //---------------------------------------------------------------------- void annPlaneSplit( // split points by a plane ANNpointArray pa, // points to split ANNidxArray pidx, // point indices int n, // number of points int d, // dimension along which to split ANNcoord cv, // cutting value int &br1, // first break (values < cv) int &br2) // second break (values == cv) { int l = 0; int r = n-1; for(;;) { // partition pa[0..n-1] about cv while (l < n && PA(l,d) < cv) l++; while (r >= 0 && PA(r,d) >= cv) r--; if (l > r) break; PASWAP(l,r); l++; r--; } br1 = l; // now: pa[0..br1-1] < cv <= pa[br1..n-1] r = n-1; for(;;) { // partition pa[br1..n-1] about cv while (l < n && PA(l,d) <= cv) l++; while (r >= br1 && PA(r,d) > cv) r--; if (l > r) break; PASWAP(l,r); l++; r--; } br2 = l; // now: pa[br1..br2-1] == cv < pa[br2..n-1] } //---------------------------------------------------------------------- // annBoxSplit - split point array about a orthogonal rectangle // Split the points in an array about a given orthogonal // rectangle. On exit, n_in is set to the number of points // that are inside (or on the boundary of) the rectangle. // // All indexing is done indirectly through the index array pidx. // //---------------------------------------------------------------------- void annBoxSplit( // split points by a box ANNpointArray pa, // points to split ANNidxArray pidx, // point indices int n, // number of points int dim, // dimension of space ANNorthRect &box, // the box int &n_in) // number of points inside (returned) { int l = 0; int r = n-1; for(;;) { // partition pa[0..n-1] about box while (l < n && box.inside(dim, PP(l))) l++; while (r >= 0 && !box.inside(dim, PP(r))) r--; if (l > r) break; PASWAP(l,r); l++; r--; } n_in = l; // now: pa[0..n_in-1] inside and rest outside } //---------------------------------------------------------------------- // annSplitBalance - compute balance factor for a given plane split // Balance factor is defined as the number of points lying // below the splitting value minus n/2 (median). Thus, a // median split has balance 0, left of this is negative and // right of this is positive. (The points are unchanged.) //---------------------------------------------------------------------- int annSplitBalance( // determine balance factor of a split ANNpointArray pa, // points to split ANNidxArray pidx, // point indices int n, // number of points int d, // dimension along which to split ANNcoord cv) // cutting value { int n_lo = 0; for(int i = 0; i < n; i++) { // count number less than cv if (PA(i,d) < cv) n_lo++; } return n_lo - n/2; } //---------------------------------------------------------------------- // annBox2Bnds - convert bounding box to list of bounds // Given two boxes, an inner box enclosed within a bounding // box, this routine determines all the sides for which the // inner box is strictly contained with the bounding box, // and adds an appropriate entry to a list of bounds. Then // we allocate storage for the final list of bounds, and return // the resulting list and its size. //---------------------------------------------------------------------- void annBox2Bnds( // convert inner box to bounds const ANNorthRect &inner_box, // inner box const ANNorthRect &bnd_box, // enclosing box int dim, // dimension of space int &n_bnds, // number of bounds (returned) ANNorthHSArray &bnds) // bounds array (returned) { int i; n_bnds = 0; // count number of bounds for (i = 0; i < dim; i++) { if (inner_box.lo[i] > bnd_box.lo[i]) // low bound is inside n_bnds++; if (inner_box.hi[i] < bnd_box.hi[i]) // high bound is inside n_bnds++; } bnds = new ANNorthHalfSpace[n_bnds]; // allocate appropriate size int j = 0; for (i = 0; i < dim; i++) { // fill the array if (inner_box.lo[i] > bnd_box.lo[i]) { bnds[j].cd = i; bnds[j].cv = inner_box.lo[i]; bnds[j].sd = +1; j++; } if (inner_box.hi[i] < bnd_box.hi[i]) { bnds[j].cd = i; bnds[j].cv = inner_box.hi[i]; bnds[j].sd = -1; j++; } } } //---------------------------------------------------------------------- // annBnds2Box - convert list of bounds to bounding box // Given an enclosing box and a list of bounds, this routine // computes the corresponding inner box. It is assumed that // the box points have been allocated already. //---------------------------------------------------------------------- void annBnds2Box( const ANNorthRect &bnd_box, // enclosing box int dim, // dimension of space int n_bnds, // number of bounds ANNorthHSArray bnds, // bounds array ANNorthRect &inner_box) // inner box (returned) { annAssignRect(dim, inner_box, bnd_box); // copy bounding box to inner for (int i = 0; i < n_bnds; i++) { bnds[i].project(inner_box.lo); // project each endpoint bnds[i].project(inner_box.hi); } } ================================================ FILE: src/ANN/kd_util.h ================================================ //---------------------------------------------------------------------- // File: kd_util.h // Programmer: Sunil Arya and David Mount // Description: Common utilities for kd- trees // Last modified: 01/04/05 (Version 1.0) //---------------------------------------------------------------------- // Copyright (c) 1997-2005 University of Maryland and Sunil Arya and // David Mount. All Rights Reserved. // // This software and related documentation is part of the Approximate // Nearest Neighbor Library (ANN). This software is provided under // the provisions of the Lesser GNU Public License (LGPL). See the // file ../ReadMe.txt for further information. // // The University of Maryland (U.M.) and the authors make no // representations about the suitability or fitness of this software for // any purpose. It is provided "as is" without express or implied // warranty. //---------------------------------------------------------------------- // History: // Revision 0.1 03/04/98 // Initial release //---------------------------------------------------------------------- #ifndef ANN_kd_util_H #define ANN_kd_util_H #include "kd_tree.h" // kd-tree declarations //---------------------------------------------------------------------- // externally accessible functions //---------------------------------------------------------------------- double annAspectRatio( // compute aspect ratio of box int dim, // dimension const ANNorthRect &bnd_box); // bounding cube void annEnclRect( // compute smallest enclosing rectangle ANNpointArray pa, // point array ANNidxArray pidx, // point indices int n, // number of points int dim, // dimension ANNorthRect &bnds); // bounding cube (returned) void annEnclCube( // compute smallest enclosing cube ANNpointArray pa, // point array ANNidxArray pidx, // point indices int n, // number of points int dim, // dimension ANNorthRect &bnds); // bounding cube (returned) ANNdist annBoxDistance( // compute distance from point to box const ANNpoint q, // the point const ANNpoint lo, // low point of box const ANNpoint hi, // high point of box int dim); // dimension of space ANNcoord annSpread( // compute point spread along dimension ANNpointArray pa, // point array ANNidxArray pidx, // point indices int n, // number of points int d); // dimension to check void annMinMax( // compute min and max coordinates along dim ANNpointArray pa, // point array ANNidxArray pidx, // point indices int n, // number of points int d, // dimension to check ANNcoord& min, // minimum value (returned) ANNcoord& max); // maximum value (returned) int annMaxSpread( // compute dimension of max spread ANNpointArray pa, // point array ANNidxArray pidx, // point indices int n, // number of points int dim); // dimension of space void annMedianSplit( // split points along median value ANNpointArray pa, // points to split ANNidxArray pidx, // point indices int n, // number of points int d, // dimension along which to split ANNcoord &cv, // cutting value int n_lo); // split into n_lo and n-n_lo void annPlaneSplit( // split points by a plane ANNpointArray pa, // points to split ANNidxArray pidx, // point indices int n, // number of points int d, // dimension along which to split ANNcoord cv, // cutting value int &br1, // first break (values < cv) int &br2); // second break (values == cv) void annBoxSplit( // split points by a box ANNpointArray pa, // points to split ANNidxArray pidx, // point indices int n, // number of points int dim, // dimension of space ANNorthRect &box, // the box int &n_in); // number of points inside (returned) int annSplitBalance( // determine balance factor of a split ANNpointArray pa, // points to split ANNidxArray pidx, // point indices int n, // number of points int d, // dimension along which to split ANNcoord cv); // cutting value void annBox2Bnds( // convert inner box to bounds const ANNorthRect &inner_box, // inner box const ANNorthRect &bnd_box, // enclosing box int dim, // dimension of space int &n_bnds, // number of bounds (returned) ANNorthHSArray &bnds); // bounds array (returned) void annBnds2Box( // convert bounds to inner box const ANNorthRect &bnd_box, // enclosing box int dim, // dimension of space int n_bnds, // number of bounds ANNorthHSArray bnds, // bounds array ANNorthRect &inner_box); // inner box (returned) #endif ================================================ FILE: src/ANN/perf.cpp ================================================ //---------------------------------------------------------------------- // File: perf.cpp // Programmer: Sunil Arya and David Mount // Description: Methods for performance stats // Last modified: 01/04/05 (Version 1.0) //---------------------------------------------------------------------- // Copyright (c) 1997-2005 University of Maryland and Sunil Arya and // David Mount. All Rights Reserved. // // This software and related documentation is part of the Approximate // Nearest Neighbor Library (ANN). This software is provided under // the provisions of the Lesser GNU Public License (LGPL). See the // file ../ReadMe.txt for further information. // // The University of Maryland (U.M.) and the authors make no // representations about the suitability or fitness of this software for // any purpose. It is provided "as is" without express or implied // warranty. //---------------------------------------------------------------------- // History: // Revision 0.1 03/04/98 // Initial release // Revision 1.0 04/01/05 // Changed names to avoid namespace conflicts. // Added flush after printing performance stats to fix bug // in Microsoft Windows version. //---------------------------------------------------------------------- #include "ANN.h" // basic ANN includes #include "ANNperf.h" // performance includes using namespace std; // make std:: available //---------------------------------------------------------------------- // Performance statistics // The following data and routines are used for computing // performance statistics for nearest neighbor searching. // Because these routines can slow the code down, they can be // activated and deactiviated by defining the PERF variable, // by compiling with the option: -DPERF //---------------------------------------------------------------------- //---------------------------------------------------------------------- // Global counters for performance measurement //---------------------------------------------------------------------- int ann_Ndata_pts = 0; // number of data points int ann_Nvisit_lfs = 0; // number of leaf nodes visited int ann_Nvisit_spl = 0; // number of splitting nodes visited int ann_Nvisit_shr = 0; // number of shrinking nodes visited int ann_Nvisit_pts = 0; // visited points for one query int ann_Ncoord_hts = 0; // coordinate hits for one query int ann_Nfloat_ops = 0; // floating ops for one query ANNsampStat ann_visit_lfs; // stats on leaf nodes visits ANNsampStat ann_visit_spl; // stats on splitting nodes visits ANNsampStat ann_visit_shr; // stats on shrinking nodes visits ANNsampStat ann_visit_nds; // stats on total nodes visits ANNsampStat ann_visit_pts; // stats on points visited ANNsampStat ann_coord_hts; // stats on coordinate hits ANNsampStat ann_float_ops; // stats on floating ops // ANNsampStat ann_average_err; // average error ANNsampStat ann_rank_err; // rank error //---------------------------------------------------------------------- // Routines for statistics. //---------------------------------------------------------------------- DLL_API void annResetStats(int data_size) // reset stats for a set of queries { ann_Ndata_pts = data_size; ann_visit_lfs.reset(); ann_visit_spl.reset(); ann_visit_shr.reset(); ann_visit_nds.reset(); ann_visit_pts.reset(); ann_coord_hts.reset(); ann_float_ops.reset(); ann_average_err.reset(); ann_rank_err.reset(); } DLL_API void annResetCounts() // reset counts for one query { ann_Nvisit_lfs = 0; ann_Nvisit_spl = 0; ann_Nvisit_shr = 0; ann_Nvisit_pts = 0; ann_Ncoord_hts = 0; ann_Nfloat_ops = 0; } DLL_API void annUpdateStats() // update stats with current counts { ann_visit_lfs += ann_Nvisit_lfs; ann_visit_nds += ann_Nvisit_spl + ann_Nvisit_lfs; ann_visit_spl += ann_Nvisit_spl; ann_visit_shr += ann_Nvisit_shr; ann_visit_pts += ann_Nvisit_pts; ann_coord_hts += ann_Ncoord_hts; ann_float_ops += ann_Nfloat_ops; } // print a single statistic void print_one_stat(const char *title, ANNsampStat s, double div) { //R does not allow: cout << title << "= [ "; //R does not allow: cout.width(9); cout << s.mean()/div << " : "; //R does not allow: cout.width(9); cout << s.stdDev()/div << " ]<"; //R does not allow: cout.width(9); cout << s.min()/div << " , "; //R does not allow: cout.width(9); cout << s.max()/div << " >\n"; } DLL_API void annPrintStats( // print statistics for a run ANNbool validate) // true if average errors desired { //R does not allow: cout.precision(4); // set floating precision //R does not allow: cout << " (Performance stats: " //R does not allow: << " [ mean : stddev ]< min , max >\n"; print_one_stat(" leaf_nodes ", ann_visit_lfs, 1); print_one_stat(" splitting_nodes ", ann_visit_spl, 1); print_one_stat(" shrinking_nodes ", ann_visit_shr, 1); print_one_stat(" total_nodes ", ann_visit_nds, 1); print_one_stat(" points_visited ", ann_visit_pts, 1); print_one_stat(" coord_hits/pt ", ann_coord_hts, ann_Ndata_pts); print_one_stat(" floating_ops_(K) ", ann_float_ops, 1000); if (validate) { print_one_stat(" average_error ", ann_average_err, 1); print_one_stat(" rank_error ", ann_rank_err, 1); } //R does not allow: cout.precision(0); // restore the default //R does not allow: cout << " )\n"; //R does not allow: cout.flush(); } ================================================ FILE: src/ANN/pr_queue.h ================================================ //---------------------------------------------------------------------- // File: pr_queue.h // Programmer: Sunil Arya and David Mount // Description: Include file for priority queue and related // structures. // Last modified: 01/04/05 (Version 1.0) //---------------------------------------------------------------------- // Copyright (c) 1997-2005 University of Maryland and Sunil Arya and // David Mount. All Rights Reserved. // // This software and related documentation is part of the Approximate // Nearest Neighbor Library (ANN). This software is provided under // the provisions of the Lesser GNU Public License (LGPL). See the // file ../ReadMe.txt for further information. // // The University of Maryland (U.M.) and the authors make no // representations about the suitability or fitness of this software for // any purpose. It is provided "as is" without express or implied // warranty. //---------------------------------------------------------------------- // History: // Revision 0.1 03/04/98 // Initial release //---------------------------------------------------------------------- #ifndef PR_QUEUE_H #define PR_QUEUE_H #include "ANNx.h" // all ANN includes #include "ANNperf.h" // performance evaluation //---------------------------------------------------------------------- // Basic types. //---------------------------------------------------------------------- typedef void *PQinfo; // info field is generic pointer typedef ANNdist PQkey; // key field is distance //---------------------------------------------------------------------- // Priority queue // A priority queue is a list of items, along with associated // priorities. The basic operations are insert and extract_minimum. // // The priority queue is maintained using a standard binary heap. // (Implementation note: Indexing is performed from [1..max] rather // than the C standard of [0..max-1]. This simplifies parent/child // computations.) User information consists of a void pointer, // and the user is responsible for casting this quantity into whatever // useful form is desired. // // Because the priority queue is so central to the efficiency of // query processing, all the code is inline. //---------------------------------------------------------------------- class ANNpr_queue { struct pq_node { // node in priority queue PQkey key; // key value PQinfo info; // info field }; int n; // number of items in queue int max_size; // maximum queue size pq_node *pq; // the priority queue (array of nodes) public: ANNpr_queue(int max) // constructor (given max size) { n = 0; // initially empty max_size = max; // maximum number of items pq = new pq_node[max+1]; // queue is array [1..max] of nodes } ~ANNpr_queue() // destructor { delete [] pq; } ANNbool empty() // is queue empty? { if (n==0) return ANNtrue; else return ANNfalse; } ANNbool non_empty() // is queue nonempty? { if (n==0) return ANNfalse; else return ANNtrue; } void reset() // make existing queue empty { n = 0; } inline void insert( // insert item (inlined for speed) PQkey kv, // key value PQinfo inf) // item info { if (++n > max_size) annError("Priority queue overflow.", ANNabort); int r = n; while (r > 1) { // sift up new item int p = r/2; ANN_FLOP(1) // increment floating ops if (pq[p].key <= kv) // in proper order break; pq[r] = pq[p]; // else swap with parent r = p; } pq[r].key = kv; // insert new item at final location pq[r].info = inf; } inline void extr_min( // extract minimum (inlined for speed) PQkey &kv, // key (returned) PQinfo &inf) // item info (returned) { kv = pq[1].key; // key of min item inf = pq[1].info; // information of min item PQkey kn = pq[n--].key;// last item in queue int p = 1; // p points to item out of position int r = p<<1; // left child of p while (r <= n) { // while r is still within the heap ANN_FLOP(2) // increment floating ops // set r to smaller child of p if (r < n && pq[r].key > pq[r+1].key) r++; if (kn <= pq[r].key) // in proper order break; pq[p] = pq[r]; // else swap with child p = r; // advance pointers r = p<<1; } pq[p] = pq[n+1]; // insert last item in proper place } }; #endif ================================================ FILE: src/ANN/pr_queue_k.h ================================================ //---------------------------------------------------------------------- // File: pr_queue_k.h // Programmer: Sunil Arya and David Mount // Description: Include file for priority queue with k items. // Last modified: 01/04/05 (Version 1.0) //---------------------------------------------------------------------- // Copyright (c) 1997-2005 University of Maryland and Sunil Arya and // David Mount. All Rights Reserved. // // This software and related documentation is part of the Approximate // Nearest Neighbor Library (ANN). This software is provided under // the provisions of the Lesser GNU Public License (LGPL). See the // file ../ReadMe.txt for further information. // // The University of Maryland (U.M.) and the authors make no // representations about the suitability or fitness of this software for // any purpose. It is provided "as is" without express or implied // warranty. //---------------------------------------------------------------------- // History: // Revision 0.1 03/04/98 // Initial release //---------------------------------------------------------------------- #ifndef PR_QUEUE_K_H #define PR_QUEUE_K_H #include "ANNx.h" // all ANN includes #include "ANNperf.h" // performance evaluation //---------------------------------------------------------------------- // Basic types //---------------------------------------------------------------------- typedef ANNdist PQKkey; // key field is distance typedef int PQKinfo; // info field is int //---------------------------------------------------------------------- // Constants // The NULL key value is used to initialize the priority queue, and // so it should be larger than any valid distance, so that it will // be replaced as legal distance values are inserted. The NULL // info value must be a nonvalid array index, we use ANN_NULL_IDX, // which is guaranteed to be negative. //---------------------------------------------------------------------- const PQKkey PQ_NULL_KEY = ANN_DIST_INF; // nonexistent key value const PQKinfo PQ_NULL_INFO = ANN_NULL_IDX; // nonexistent info value //---------------------------------------------------------------------- // ANNmin_k // An ANNmin_k structure is one which maintains the smallest // k values (of type PQKkey) and associated information (of type // PQKinfo). The special info and key values PQ_NULL_INFO and // PQ_NULL_KEY means that thise entry is empty. // // It is currently implemented using an array with k items. // Items are stored in increasing sorted order, and insertions // are made through standard insertion sort. (This is quite // inefficient, but current applications call for small values // of k and relatively few insertions.) // // Note that the list contains k+1 entries, but the last entry // is used as a simple placeholder and is otherwise ignored. //---------------------------------------------------------------------- class ANNmin_k { struct mk_node { // node in min_k structure PQKkey key; // key value PQKinfo info; // info field (user defined) }; int k; // max number of keys to store int n; // number of keys currently active mk_node *mk; // the list itself public: ANNmin_k(int max) // constructor (given max size) { n = 0; // initially no items k = max; // maximum number of items mk = new mk_node[max+1]; // sorted array of keys } ~ANNmin_k() // destructor { delete [] mk; } PQKkey ANNmin_key() // return minimum key { return (n > 0 ? mk[0].key : PQ_NULL_KEY); } PQKkey max_key() // return maximum key { return (n == k ? mk[k-1].key : PQ_NULL_KEY); } PQKkey ith_smallest_key(int i) // ith smallest key (i in [0..n-1]) { return (i < n ? mk[i].key : PQ_NULL_KEY); } PQKinfo ith_smallest_info(int i) // info for ith smallest (i in [0..n-1]) { return (i < n ? mk[i].info : PQ_NULL_INFO); } inline void insert( // insert item (inlined for speed) PQKkey kv, // key value PQKinfo inf) // item info { int i; // slide larger values up for (i = n; i > 0; i--) { if (mk[i-1].key > kv) mk[i] = mk[i-1]; else break; } mk[i].key = kv; // store element here mk[i].info = inf; if (n < k) n++; // increment number of items ANN_FLOP(k-i+1) // increment floating ops } }; #endif ================================================ FILE: src/JP.cpp ================================================ //---------------------------------------------------------------------- // Jarvis-Patrick Clustering //---------------------------------------------------------------------- // Copyright (c) 2017 Michael Hahsler. All Rights Reserved. // // This software is provided under the provisions of the // GNU General Public License (GPL) Version 3 // (see: http://www.gnu.org/licenses/gpl-3.0.en.html) #include using namespace Rcpp; // [[Rcpp::export]] IntegerVector JP_int(IntegerMatrix nn, unsigned int kt) { R_xlen_t n = nn.nrow(); // create label vector std::vector label(n); //iota is C++11 only //std::iota(std::begin(label), std::end(label), 1); // Fill with 1, 2, ..., n. int value = 1; std::vector::iterator first = label.begin(), last = label.end(); while(first != last) *first++ = value++; // create sorted sets so we can use set operations std::vector< std::set > nn_set(nn.nrow()); IntegerVector r; std::vector s; for(R_xlen_t i = 0; i < n; ++i) { r = nn(i,_); s = as >(r); nn_set[i].insert(s.begin(), s.end()); } std::vector z; std::set::iterator it; R_xlen_t i, j; int newlabel, oldlabel; for(i = 0; i < n; ++i) { // check all neighbors of i for (it = nn_set[i].begin(); it != nn_set[i].end(); ++it) { j = *it-1; // index in nn starts with 1 // edge was already checked if(j= kt) { // update labels if(label[i] > label[j]) { newlabel = label[j]; oldlabel = label[i]; }else{ newlabel = label[i]; oldlabel = label[j]; } for(int k = 0; k < n; ++k) { if(label[k] == oldlabel) label[k] = newlabel; } } } } } return wrap(label); } // jp == true: use the definition by Jarvis-Patrick: A link is created between a pair of // points, p and q, if and only if p and q have each other in their k-nearest neighbor lists. // jp == false: just count the shared NNs = regular sNN // [[Rcpp::export]] IntegerMatrix SNN_sim_int(IntegerMatrix nn, LogicalVector jp) { R_xlen_t n = nn.nrow(); R_xlen_t k = nn.ncol(); IntegerMatrix snn(n, k); // create sorted sets so we can use set operations std::vector< std::set > nn_set(n); IntegerVector r; std::vector s; for(R_xlen_t i = 0; i < n; ++i) { r = nn(i,_); s = as >(r); nn_set[i].insert(s.begin(), s.end()); } std::vector z; int j; for(R_xlen_t i = 0; i < n; ++i) { // check all neighbors of i for (R_xlen_t j_ind = 0; j_ind < k; ++j_ind) { j = nn(i, j_ind)-1; bool i_in_j = (nn_set[j].find(i+1) != nn_set[j].end()); if(is_false(all(jp)) || i_in_j) { // calculate link strength as the number of shared points z.clear(); std::set_intersection(nn_set[i].begin(), nn_set[i].end(), nn_set[j].begin(), nn_set[j].end(), std::back_inserter(z)); snn(i, j_ind) = z.size(); // +1 if i is in j if(i_in_j) snn(i, j_ind)++; }else snn(i, j_ind) = 0; } } return snn; } ================================================ FILE: src/Makevars ================================================ # CXX_STD = CXX11 SOURCES = \ ANN/perf.cpp ANN/bd_fix_rad_search.cpp ANN/bd_search.cpp \ ANN/kd_split.cpp ANN/kd_pr_search.cpp ANN/kd_search.cpp \ ANN/ANN.cpp ANN/brute.cpp ANN/bd_tree.cpp ANN/kd_fix_rad_search.cpp \ ANN/bd_pr_search.cpp ANN/kd_util.cpp ANN/kd_tree.cpp ANN/kd_dump.cpp \ utilities.cpp cleanup.cpp \ kNN.cpp connectedComps.cpp \ frNN.cpp regionQuery.cpp density.cpp \ dbscan.cpp \ optics.cpp \ JP.cpp \ hdbscan.cpp \ dendrogram.cpp UnionFind.cpp \ mrd.cpp \ mst.cpp \ lof.cpp \ dbcv.cpp \ RcppExports.cpp OBJECTS = $(SOURCES:.cpp=.o) ================================================ FILE: src/RcppExports.cpp ================================================ // Generated by using Rcpp::compileAttributes() -> do not edit by hand // Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393 #include using namespace Rcpp; #ifdef RCPP_USE_GLOBAL_ROSTREAM Rcpp::Rostream& Rcpp::Rcout = Rcpp::Rcpp_cout_get(); Rcpp::Rostream& Rcpp::Rcerr = Rcpp::Rcpp_cerr_get(); #endif // JP_int IntegerVector JP_int(IntegerMatrix nn, unsigned int kt); RcppExport SEXP _dbscan_JP_int(SEXP nnSEXP, SEXP ktSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< IntegerMatrix >::type nn(nnSEXP); Rcpp::traits::input_parameter< unsigned int >::type kt(ktSEXP); rcpp_result_gen = Rcpp::wrap(JP_int(nn, kt)); return rcpp_result_gen; END_RCPP } // SNN_sim_int IntegerMatrix SNN_sim_int(IntegerMatrix nn, LogicalVector jp); RcppExport SEXP _dbscan_SNN_sim_int(SEXP nnSEXP, SEXP jpSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< IntegerMatrix >::type nn(nnSEXP); Rcpp::traits::input_parameter< LogicalVector >::type jp(jpSEXP); rcpp_result_gen = Rcpp::wrap(SNN_sim_int(nn, jp)); return rcpp_result_gen; END_RCPP } // ANN_cleanup void ANN_cleanup(); RcppExport SEXP _dbscan_ANN_cleanup() { BEGIN_RCPP Rcpp::RNGScope rcpp_rngScope_gen; ANN_cleanup(); return R_NilValue; END_RCPP } // comps_kNN IntegerVector comps_kNN(IntegerMatrix nn, bool mutual); RcppExport SEXP _dbscan_comps_kNN(SEXP nnSEXP, SEXP mutualSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< IntegerMatrix >::type nn(nnSEXP); Rcpp::traits::input_parameter< bool >::type mutual(mutualSEXP); rcpp_result_gen = Rcpp::wrap(comps_kNN(nn, mutual)); return rcpp_result_gen; END_RCPP } // comps_frNN IntegerVector comps_frNN(List nn, bool mutual); RcppExport SEXP _dbscan_comps_frNN(SEXP nnSEXP, SEXP mutualSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< List >::type nn(nnSEXP); Rcpp::traits::input_parameter< bool >::type mutual(mutualSEXP); rcpp_result_gen = Rcpp::wrap(comps_frNN(nn, mutual)); return rcpp_result_gen; END_RCPP } // intToStr StringVector intToStr(IntegerVector iv); RcppExport SEXP _dbscan_intToStr(SEXP ivSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< IntegerVector >::type iv(ivSEXP); rcpp_result_gen = Rcpp::wrap(intToStr(iv)); return rcpp_result_gen; END_RCPP } // dist_subset NumericVector dist_subset(const NumericVector& dist, IntegerVector idx); RcppExport SEXP _dbscan_dist_subset(SEXP distSEXP, SEXP idxSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< const NumericVector& >::type dist(distSEXP); Rcpp::traits::input_parameter< IntegerVector >::type idx(idxSEXP); rcpp_result_gen = Rcpp::wrap(dist_subset(dist, idx)); return rcpp_result_gen; END_RCPP } // XOR Rcpp::LogicalVector XOR(Rcpp::LogicalVector lhs, Rcpp::LogicalVector rhs); RcppExport SEXP _dbscan_XOR(SEXP lhsSEXP, SEXP rhsSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< Rcpp::LogicalVector >::type lhs(lhsSEXP); Rcpp::traits::input_parameter< Rcpp::LogicalVector >::type rhs(rhsSEXP); rcpp_result_gen = Rcpp::wrap(XOR(lhs, rhs)); return rcpp_result_gen; END_RCPP } // dspc NumericMatrix dspc(const List& cl_idx, const List& internal_nodes, const IntegerVector& all_cl_ids, const NumericVector& mrd_dist); RcppExport SEXP _dbscan_dspc(SEXP cl_idxSEXP, SEXP internal_nodesSEXP, SEXP all_cl_idsSEXP, SEXP mrd_distSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< const List& >::type cl_idx(cl_idxSEXP); Rcpp::traits::input_parameter< const List& >::type internal_nodes(internal_nodesSEXP); Rcpp::traits::input_parameter< const IntegerVector& >::type all_cl_ids(all_cl_idsSEXP); Rcpp::traits::input_parameter< const NumericVector& >::type mrd_dist(mrd_distSEXP); rcpp_result_gen = Rcpp::wrap(dspc(cl_idx, internal_nodes, all_cl_ids, mrd_dist)); return rcpp_result_gen; END_RCPP } // dbscan_int IntegerVector dbscan_int(NumericMatrix data, double eps, int minPts, NumericVector weights, int borderPoints, int type, int bucketSize, int splitRule, double approx, List frNN); RcppExport SEXP _dbscan_dbscan_int(SEXP dataSEXP, SEXP epsSEXP, SEXP minPtsSEXP, SEXP weightsSEXP, SEXP borderPointsSEXP, SEXP typeSEXP, SEXP bucketSizeSEXP, SEXP splitRuleSEXP, SEXP approxSEXP, SEXP frNNSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< NumericMatrix >::type data(dataSEXP); Rcpp::traits::input_parameter< double >::type eps(epsSEXP); Rcpp::traits::input_parameter< int >::type minPts(minPtsSEXP); Rcpp::traits::input_parameter< NumericVector >::type weights(weightsSEXP); Rcpp::traits::input_parameter< int >::type borderPoints(borderPointsSEXP); Rcpp::traits::input_parameter< int >::type type(typeSEXP); Rcpp::traits::input_parameter< int >::type bucketSize(bucketSizeSEXP); Rcpp::traits::input_parameter< int >::type splitRule(splitRuleSEXP); Rcpp::traits::input_parameter< double >::type approx(approxSEXP); Rcpp::traits::input_parameter< List >::type frNN(frNNSEXP); rcpp_result_gen = Rcpp::wrap(dbscan_int(data, eps, minPts, weights, borderPoints, type, bucketSize, splitRule, approx, frNN)); return rcpp_result_gen; END_RCPP } // reach_to_dendrogram List reach_to_dendrogram(const Rcpp::List reachability, const NumericVector pl_order); RcppExport SEXP _dbscan_reach_to_dendrogram(SEXP reachabilitySEXP, SEXP pl_orderSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< const Rcpp::List >::type reachability(reachabilitySEXP); Rcpp::traits::input_parameter< const NumericVector >::type pl_order(pl_orderSEXP); rcpp_result_gen = Rcpp::wrap(reach_to_dendrogram(reachability, pl_order)); return rcpp_result_gen; END_RCPP } // dendrogram_to_reach List dendrogram_to_reach(const Rcpp::List x); RcppExport SEXP _dbscan_dendrogram_to_reach(SEXP xSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< const Rcpp::List >::type x(xSEXP); rcpp_result_gen = Rcpp::wrap(dendrogram_to_reach(x)); return rcpp_result_gen; END_RCPP } // mst_to_dendrogram List mst_to_dendrogram(const NumericMatrix mst); RcppExport SEXP _dbscan_mst_to_dendrogram(SEXP mstSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< const NumericMatrix >::type mst(mstSEXP); rcpp_result_gen = Rcpp::wrap(mst_to_dendrogram(mst)); return rcpp_result_gen; END_RCPP } // dbscan_density_int IntegerVector dbscan_density_int(NumericMatrix data, double eps, int type, int bucketSize, int splitRule, double approx); RcppExport SEXP _dbscan_dbscan_density_int(SEXP dataSEXP, SEXP epsSEXP, SEXP typeSEXP, SEXP bucketSizeSEXP, SEXP splitRuleSEXP, SEXP approxSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< NumericMatrix >::type data(dataSEXP); Rcpp::traits::input_parameter< double >::type eps(epsSEXP); Rcpp::traits::input_parameter< int >::type type(typeSEXP); Rcpp::traits::input_parameter< int >::type bucketSize(bucketSizeSEXP); Rcpp::traits::input_parameter< int >::type splitRule(splitRuleSEXP); Rcpp::traits::input_parameter< double >::type approx(approxSEXP); rcpp_result_gen = Rcpp::wrap(dbscan_density_int(data, eps, type, bucketSize, splitRule, approx)); return rcpp_result_gen; END_RCPP } // frNN_int List frNN_int(NumericMatrix data, double eps, int type, int bucketSize, int splitRule, double approx); RcppExport SEXP _dbscan_frNN_int(SEXP dataSEXP, SEXP epsSEXP, SEXP typeSEXP, SEXP bucketSizeSEXP, SEXP splitRuleSEXP, SEXP approxSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< NumericMatrix >::type data(dataSEXP); Rcpp::traits::input_parameter< double >::type eps(epsSEXP); Rcpp::traits::input_parameter< int >::type type(typeSEXP); Rcpp::traits::input_parameter< int >::type bucketSize(bucketSizeSEXP); Rcpp::traits::input_parameter< int >::type splitRule(splitRuleSEXP); Rcpp::traits::input_parameter< double >::type approx(approxSEXP); rcpp_result_gen = Rcpp::wrap(frNN_int(data, eps, type, bucketSize, splitRule, approx)); return rcpp_result_gen; END_RCPP } // frNN_query_int List frNN_query_int(NumericMatrix data, NumericMatrix query, double eps, int type, int bucketSize, int splitRule, double approx); RcppExport SEXP _dbscan_frNN_query_int(SEXP dataSEXP, SEXP querySEXP, SEXP epsSEXP, SEXP typeSEXP, SEXP bucketSizeSEXP, SEXP splitRuleSEXP, SEXP approxSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< NumericMatrix >::type data(dataSEXP); Rcpp::traits::input_parameter< NumericMatrix >::type query(querySEXP); Rcpp::traits::input_parameter< double >::type eps(epsSEXP); Rcpp::traits::input_parameter< int >::type type(typeSEXP); Rcpp::traits::input_parameter< int >::type bucketSize(bucketSizeSEXP); Rcpp::traits::input_parameter< int >::type splitRule(splitRuleSEXP); Rcpp::traits::input_parameter< double >::type approx(approxSEXP); rcpp_result_gen = Rcpp::wrap(frNN_query_int(data, query, eps, type, bucketSize, splitRule, approx)); return rcpp_result_gen; END_RCPP } // distToAdjacency List distToAdjacency(IntegerVector constraints, const int N); RcppExport SEXP _dbscan_distToAdjacency(SEXP constraintsSEXP, SEXP NSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< IntegerVector >::type constraints(constraintsSEXP); Rcpp::traits::input_parameter< const int >::type N(NSEXP); rcpp_result_gen = Rcpp::wrap(distToAdjacency(constraints, N)); return rcpp_result_gen; END_RCPP } // buildDendrogram List buildDendrogram(List hcl); RcppExport SEXP _dbscan_buildDendrogram(SEXP hclSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< List >::type hcl(hclSEXP); rcpp_result_gen = Rcpp::wrap(buildDendrogram(hcl)); return rcpp_result_gen; END_RCPP } // all_children IntegerVector all_children(List hier, int key, bool leaves_only); RcppExport SEXP _dbscan_all_children(SEXP hierSEXP, SEXP keySEXP, SEXP leaves_onlySEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< List >::type hier(hierSEXP); Rcpp::traits::input_parameter< int >::type key(keySEXP); Rcpp::traits::input_parameter< bool >::type leaves_only(leaves_onlySEXP); rcpp_result_gen = Rcpp::wrap(all_children(hier, key, leaves_only)); return rcpp_result_gen; END_RCPP } // node_xy NumericMatrix node_xy(List cl_tree, List cl_hierarchy, int cid); RcppExport SEXP _dbscan_node_xy(SEXP cl_treeSEXP, SEXP cl_hierarchySEXP, SEXP cidSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< List >::type cl_tree(cl_treeSEXP); Rcpp::traits::input_parameter< List >::type cl_hierarchy(cl_hierarchySEXP); Rcpp::traits::input_parameter< int >::type cid(cidSEXP); rcpp_result_gen = Rcpp::wrap(node_xy(cl_tree, cl_hierarchy, cid)); return rcpp_result_gen; END_RCPP } // simplifiedTree List simplifiedTree(List cl_tree); RcppExport SEXP _dbscan_simplifiedTree(SEXP cl_treeSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< List >::type cl_tree(cl_treeSEXP); rcpp_result_gen = Rcpp::wrap(simplifiedTree(cl_tree)); return rcpp_result_gen; END_RCPP } // computeStability List computeStability(const List hcl, const int minPts, bool compute_glosh); RcppExport SEXP _dbscan_computeStability(SEXP hclSEXP, SEXP minPtsSEXP, SEXP compute_gloshSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< const List >::type hcl(hclSEXP); Rcpp::traits::input_parameter< const int >::type minPts(minPtsSEXP); Rcpp::traits::input_parameter< bool >::type compute_glosh(compute_gloshSEXP); rcpp_result_gen = Rcpp::wrap(computeStability(hcl, minPts, compute_glosh)); return rcpp_result_gen; END_RCPP } // validateConstraintList List validateConstraintList(List& constraints, int n); RcppExport SEXP _dbscan_validateConstraintList(SEXP constraintsSEXP, SEXP nSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< List& >::type constraints(constraintsSEXP); Rcpp::traits::input_parameter< int >::type n(nSEXP); rcpp_result_gen = Rcpp::wrap(validateConstraintList(constraints, n)); return rcpp_result_gen; END_RCPP } // computeVirtualNode double computeVirtualNode(IntegerVector noise, List constraints); RcppExport SEXP _dbscan_computeVirtualNode(SEXP noiseSEXP, SEXP constraintsSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< IntegerVector >::type noise(noiseSEXP); Rcpp::traits::input_parameter< List >::type constraints(constraintsSEXP); rcpp_result_gen = Rcpp::wrap(computeVirtualNode(noise, constraints)); return rcpp_result_gen; END_RCPP } // fosc NumericVector fosc(List cl_tree, std::string cid, std::list& sc, List cl_hierarchy, bool prune_unstable_leaves, double cluster_selection_epsilon, const double alpha, bool useVirtual, const int n_constraints, List constraints); RcppExport SEXP _dbscan_fosc(SEXP cl_treeSEXP, SEXP cidSEXP, SEXP scSEXP, SEXP cl_hierarchySEXP, SEXP prune_unstable_leavesSEXP, SEXP cluster_selection_epsilonSEXP, SEXP alphaSEXP, SEXP useVirtualSEXP, SEXP n_constraintsSEXP, SEXP constraintsSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< List >::type cl_tree(cl_treeSEXP); Rcpp::traits::input_parameter< std::string >::type cid(cidSEXP); Rcpp::traits::input_parameter< std::list& >::type sc(scSEXP); Rcpp::traits::input_parameter< List >::type cl_hierarchy(cl_hierarchySEXP); Rcpp::traits::input_parameter< bool >::type prune_unstable_leaves(prune_unstable_leavesSEXP); Rcpp::traits::input_parameter< double >::type cluster_selection_epsilon(cluster_selection_epsilonSEXP); Rcpp::traits::input_parameter< const double >::type alpha(alphaSEXP); Rcpp::traits::input_parameter< bool >::type useVirtual(useVirtualSEXP); Rcpp::traits::input_parameter< const int >::type n_constraints(n_constraintsSEXP); Rcpp::traits::input_parameter< List >::type constraints(constraintsSEXP); rcpp_result_gen = Rcpp::wrap(fosc(cl_tree, cid, sc, cl_hierarchy, prune_unstable_leaves, cluster_selection_epsilon, alpha, useVirtual, n_constraints, constraints)); return rcpp_result_gen; END_RCPP } // extractUnsupervised List extractUnsupervised(List cl_tree, bool prune_unstable, double cluster_selection_epsilon); RcppExport SEXP _dbscan_extractUnsupervised(SEXP cl_treeSEXP, SEXP prune_unstableSEXP, SEXP cluster_selection_epsilonSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< List >::type cl_tree(cl_treeSEXP); Rcpp::traits::input_parameter< bool >::type prune_unstable(prune_unstableSEXP); Rcpp::traits::input_parameter< double >::type cluster_selection_epsilon(cluster_selection_epsilonSEXP); rcpp_result_gen = Rcpp::wrap(extractUnsupervised(cl_tree, prune_unstable, cluster_selection_epsilon)); return rcpp_result_gen; END_RCPP } // extractSemiSupervised List extractSemiSupervised(List cl_tree, List constraints, float alpha, bool prune_unstable_leaves, double cluster_selection_epsilon); RcppExport SEXP _dbscan_extractSemiSupervised(SEXP cl_treeSEXP, SEXP constraintsSEXP, SEXP alphaSEXP, SEXP prune_unstable_leavesSEXP, SEXP cluster_selection_epsilonSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< List >::type cl_tree(cl_treeSEXP); Rcpp::traits::input_parameter< List >::type constraints(constraintsSEXP); Rcpp::traits::input_parameter< float >::type alpha(alphaSEXP); Rcpp::traits::input_parameter< bool >::type prune_unstable_leaves(prune_unstable_leavesSEXP); Rcpp::traits::input_parameter< double >::type cluster_selection_epsilon(cluster_selection_epsilonSEXP); rcpp_result_gen = Rcpp::wrap(extractSemiSupervised(cl_tree, constraints, alpha, prune_unstable_leaves, cluster_selection_epsilon)); return rcpp_result_gen; END_RCPP } // kNN_query_int List kNN_query_int(NumericMatrix data, NumericMatrix query, int k, int type, int bucketSize, int splitRule, double approx); RcppExport SEXP _dbscan_kNN_query_int(SEXP dataSEXP, SEXP querySEXP, SEXP kSEXP, SEXP typeSEXP, SEXP bucketSizeSEXP, SEXP splitRuleSEXP, SEXP approxSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< NumericMatrix >::type data(dataSEXP); Rcpp::traits::input_parameter< NumericMatrix >::type query(querySEXP); Rcpp::traits::input_parameter< int >::type k(kSEXP); Rcpp::traits::input_parameter< int >::type type(typeSEXP); Rcpp::traits::input_parameter< int >::type bucketSize(bucketSizeSEXP); Rcpp::traits::input_parameter< int >::type splitRule(splitRuleSEXP); Rcpp::traits::input_parameter< double >::type approx(approxSEXP); rcpp_result_gen = Rcpp::wrap(kNN_query_int(data, query, k, type, bucketSize, splitRule, approx)); return rcpp_result_gen; END_RCPP } // kNN_int List kNN_int(NumericMatrix data, int k, int type, int bucketSize, int splitRule, double approx); RcppExport SEXP _dbscan_kNN_int(SEXP dataSEXP, SEXP kSEXP, SEXP typeSEXP, SEXP bucketSizeSEXP, SEXP splitRuleSEXP, SEXP approxSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< NumericMatrix >::type data(dataSEXP); Rcpp::traits::input_parameter< int >::type k(kSEXP); Rcpp::traits::input_parameter< int >::type type(typeSEXP); Rcpp::traits::input_parameter< int >::type bucketSize(bucketSizeSEXP); Rcpp::traits::input_parameter< int >::type splitRule(splitRuleSEXP); Rcpp::traits::input_parameter< double >::type approx(approxSEXP); rcpp_result_gen = Rcpp::wrap(kNN_int(data, k, type, bucketSize, splitRule, approx)); return rcpp_result_gen; END_RCPP } // lof_kNN List lof_kNN(NumericMatrix data, int minPts, int type, int bucketSize, int splitRule, double approx); RcppExport SEXP _dbscan_lof_kNN(SEXP dataSEXP, SEXP minPtsSEXP, SEXP typeSEXP, SEXP bucketSizeSEXP, SEXP splitRuleSEXP, SEXP approxSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< NumericMatrix >::type data(dataSEXP); Rcpp::traits::input_parameter< int >::type minPts(minPtsSEXP); Rcpp::traits::input_parameter< int >::type type(typeSEXP); Rcpp::traits::input_parameter< int >::type bucketSize(bucketSizeSEXP); Rcpp::traits::input_parameter< int >::type splitRule(splitRuleSEXP); Rcpp::traits::input_parameter< double >::type approx(approxSEXP); rcpp_result_gen = Rcpp::wrap(lof_kNN(data, minPts, type, bucketSize, splitRule, approx)); return rcpp_result_gen; END_RCPP } // mrd NumericVector mrd(NumericVector dm, NumericVector cd); RcppExport SEXP _dbscan_mrd(SEXP dmSEXP, SEXP cdSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< NumericVector >::type dm(dmSEXP); Rcpp::traits::input_parameter< NumericVector >::type cd(cdSEXP); rcpp_result_gen = Rcpp::wrap(mrd(dm, cd)); return rcpp_result_gen; END_RCPP } // mst Rcpp::NumericMatrix mst(const NumericVector x_dist, const R_xlen_t n); RcppExport SEXP _dbscan_mst(SEXP x_distSEXP, SEXP nSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< const NumericVector >::type x_dist(x_distSEXP); Rcpp::traits::input_parameter< const R_xlen_t >::type n(nSEXP); rcpp_result_gen = Rcpp::wrap(mst(x_dist, n)); return rcpp_result_gen; END_RCPP } // hclustMergeOrder List hclustMergeOrder(NumericMatrix mst, IntegerVector o); RcppExport SEXP _dbscan_hclustMergeOrder(SEXP mstSEXP, SEXP oSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< NumericMatrix >::type mst(mstSEXP); Rcpp::traits::input_parameter< IntegerVector >::type o(oSEXP); rcpp_result_gen = Rcpp::wrap(hclustMergeOrder(mst, o)); return rcpp_result_gen; END_RCPP } // optics_int List optics_int(NumericMatrix data, double eps, int minPts, int type, int bucketSize, int splitRule, double approx, List frNN); RcppExport SEXP _dbscan_optics_int(SEXP dataSEXP, SEXP epsSEXP, SEXP minPtsSEXP, SEXP typeSEXP, SEXP bucketSizeSEXP, SEXP splitRuleSEXP, SEXP approxSEXP, SEXP frNNSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< NumericMatrix >::type data(dataSEXP); Rcpp::traits::input_parameter< double >::type eps(epsSEXP); Rcpp::traits::input_parameter< int >::type minPts(minPtsSEXP); Rcpp::traits::input_parameter< int >::type type(typeSEXP); Rcpp::traits::input_parameter< int >::type bucketSize(bucketSizeSEXP); Rcpp::traits::input_parameter< int >::type splitRule(splitRuleSEXP); Rcpp::traits::input_parameter< double >::type approx(approxSEXP); Rcpp::traits::input_parameter< List >::type frNN(frNNSEXP); rcpp_result_gen = Rcpp::wrap(optics_int(data, eps, minPts, type, bucketSize, splitRule, approx, frNN)); return rcpp_result_gen; END_RCPP } // lowerTri IntegerVector lowerTri(IntegerMatrix m); RcppExport SEXP _dbscan_lowerTri(SEXP mSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< IntegerMatrix >::type m(mSEXP); rcpp_result_gen = Rcpp::wrap(lowerTri(m)); return rcpp_result_gen; END_RCPP } static const R_CallMethodDef CallEntries[] = { {"_dbscan_JP_int", (DL_FUNC) &_dbscan_JP_int, 2}, {"_dbscan_SNN_sim_int", (DL_FUNC) &_dbscan_SNN_sim_int, 2}, {"_dbscan_ANN_cleanup", (DL_FUNC) &_dbscan_ANN_cleanup, 0}, {"_dbscan_comps_kNN", (DL_FUNC) &_dbscan_comps_kNN, 2}, {"_dbscan_comps_frNN", (DL_FUNC) &_dbscan_comps_frNN, 2}, {"_dbscan_intToStr", (DL_FUNC) &_dbscan_intToStr, 1}, {"_dbscan_dist_subset", (DL_FUNC) &_dbscan_dist_subset, 2}, {"_dbscan_XOR", (DL_FUNC) &_dbscan_XOR, 2}, {"_dbscan_dspc", (DL_FUNC) &_dbscan_dspc, 4}, {"_dbscan_dbscan_int", (DL_FUNC) &_dbscan_dbscan_int, 10}, {"_dbscan_reach_to_dendrogram", (DL_FUNC) &_dbscan_reach_to_dendrogram, 2}, {"_dbscan_dendrogram_to_reach", (DL_FUNC) &_dbscan_dendrogram_to_reach, 1}, {"_dbscan_mst_to_dendrogram", (DL_FUNC) &_dbscan_mst_to_dendrogram, 1}, {"_dbscan_dbscan_density_int", (DL_FUNC) &_dbscan_dbscan_density_int, 6}, {"_dbscan_frNN_int", (DL_FUNC) &_dbscan_frNN_int, 6}, {"_dbscan_frNN_query_int", (DL_FUNC) &_dbscan_frNN_query_int, 7}, {"_dbscan_distToAdjacency", (DL_FUNC) &_dbscan_distToAdjacency, 2}, {"_dbscan_buildDendrogram", (DL_FUNC) &_dbscan_buildDendrogram, 1}, {"_dbscan_all_children", (DL_FUNC) &_dbscan_all_children, 3}, {"_dbscan_node_xy", (DL_FUNC) &_dbscan_node_xy, 3}, {"_dbscan_simplifiedTree", (DL_FUNC) &_dbscan_simplifiedTree, 1}, {"_dbscan_computeStability", (DL_FUNC) &_dbscan_computeStability, 3}, {"_dbscan_validateConstraintList", (DL_FUNC) &_dbscan_validateConstraintList, 2}, {"_dbscan_computeVirtualNode", (DL_FUNC) &_dbscan_computeVirtualNode, 2}, {"_dbscan_fosc", (DL_FUNC) &_dbscan_fosc, 10}, {"_dbscan_extractUnsupervised", (DL_FUNC) &_dbscan_extractUnsupervised, 3}, {"_dbscan_extractSemiSupervised", (DL_FUNC) &_dbscan_extractSemiSupervised, 5}, {"_dbscan_kNN_query_int", (DL_FUNC) &_dbscan_kNN_query_int, 7}, {"_dbscan_kNN_int", (DL_FUNC) &_dbscan_kNN_int, 6}, {"_dbscan_lof_kNN", (DL_FUNC) &_dbscan_lof_kNN, 6}, {"_dbscan_mrd", (DL_FUNC) &_dbscan_mrd, 2}, {"_dbscan_mst", (DL_FUNC) &_dbscan_mst, 2}, {"_dbscan_hclustMergeOrder", (DL_FUNC) &_dbscan_hclustMergeOrder, 2}, {"_dbscan_optics_int", (DL_FUNC) &_dbscan_optics_int, 8}, {"_dbscan_lowerTri", (DL_FUNC) &_dbscan_lowerTri, 1}, {NULL, NULL, 0} }; RcppExport void R_init_dbscan(DllInfo *dll) { R_registerRoutines(dll, NULL, CallEntries, NULL, NULL); R_useDynamicSymbols(dll, FALSE); } ================================================ FILE: src/UnionFind.cpp ================================================ //---------------------------------------------------------------------- // Disjoint-set data structure // File: union_find.cpp //---------------------------------------------------------------------- // Copyright (c) 2016 Michael Hahsler, Matt Piekenbrock. All Rights Reserved. // // This software is provided under the provisions of the // GNU General Public License (GPL) Version 3 // (see: http://www.gnu.org/licenses/gpl-3.0.en.html) // Class definition based off of data-structure described here: // https://en.wikipedia.org/wiki/Disjoint-set_data_structure #include "UnionFind.h" UnionFind::UnionFind(const int size) : parent(size), rank(size) { for (int i = 0; i < size; ++i) { parent[i] = i, rank[i] = 0; } } // Destructor not needed w/o dynamic allocation UnionFind::~UnionFind() { } void UnionFind::Union(const int x, const int y) { const int xRoot = Find(x); const int yRoot = Find(y); if (xRoot == yRoot) return; else if (rank[xRoot] > rank[yRoot]) parent[yRoot] = xRoot; else if (rank[xRoot] < rank[yRoot]) parent[xRoot] = yRoot; else if (rank[xRoot] == rank[yRoot]) { parent[yRoot] = parent[xRoot]; rank[xRoot] = rank[xRoot] + 1; } } const int UnionFind::Find(const int x) { if (parent[x] == x) return x; else { parent[x] = Find(parent[x]); return parent[x]; } } ================================================ FILE: src/UnionFind.h ================================================ //---------------------------------------------------------------------- // Disjoint-set data structure // File: union_find.h //---------------------------------------------------------------------- // Copyright (c) 2016 Michael Hahsler, Matt Piekenbrock. All Rights Reserved. // // This software is provided under the provisions of the // GNU General Public License (GPL) Version 3 // (see: http://www.gnu.org/licenses/gpl-3.0.en.html) // Class definition based off of data-structure described here: // https://en.wikipedia.org/wiki/Disjoint-set_data_structure #ifndef UNIONFIND #define UNIONFIND #include using namespace Rcpp; class UnionFind { Rcpp::IntegerVector parent; Rcpp::IntegerVector rank; public: UnionFind(const int size); ~UnionFind(); void Union(const int x, const int y); const int Find(const int x); }; // class UnionFind #endif ================================================ FILE: src/cleanup.cpp ================================================ //---------------------------------------------------------------------- // R interface to dbscan using the ANN library //---------------------------------------------------------------------- // Copyright (c) 2015 Michael Hahsler. All Rights Reserved. // // This software is provided under the provisions of the // GNU General Public License (GPL) Version 3 // (see: http://www.gnu.org/licenses/gpl-3.0.en.html) #include #include "ANN/ANN.h" using namespace Rcpp; // [[Rcpp::export]] void ANN_cleanup() { annClose(); } ================================================ FILE: src/connectedComps.cpp ================================================ //---------------------------------------------------------------------- // R interface to dbscan using the ANN library //---------------------------------------------------------------------- // Copyright (c) 2015 Michael Hahsler. All Rights Reserved. // // This software is provided under the provisions of the // GNU General Public License (GPL) Version 3 // (see: http://www.gnu.org/licenses/gpl-3.0.en.html) #include using namespace Rcpp; // Find connected components in kNN and frNN objects. // [[Rcpp::export]] IntegerVector comps_kNN(IntegerMatrix nn, bool mutual) { R_xlen_t n = nn.nrow(); // create label vector std::vector label(n); std::iota(std::begin(label), std::end(label), 1); // Fill with 1, 2, ..., n. //iota is C++11 only //int value = 1; //std::vector::iterator first = label.begin(), last = label.end(); //while(first != last) *first++ = value++; // create sorted sets so we can use set operations std::vector< std::set > nn_set(n); IntegerVector r; std::vector s; for(int i = 0; i < n; ++i) { r = na_omit(nn(i,_)); s = as >(r); nn_set[i].insert(s.begin(), s.end()); } std::set::iterator it; R_xlen_t i, j; int newlabel, oldlabel; for(i = 0; i < n; ++i) { // check all neighbors of i for (it = nn_set[i].begin(); it != nn_set[i].end(); ++it) { j = *it-1; // index in nn starts with 1 // edge was already checked //if(j label[j]) { newlabel = label[j]; oldlabel = label[i]; }else{ newlabel = label[i]; oldlabel = label[j]; } // relabel for(int k = 0; k < n; ++k) { if(label[k] == oldlabel) label[k] = newlabel; } } } } return wrap(label); } // [[Rcpp::export]] IntegerVector comps_frNN(List nn, bool mutual) { R_xlen_t n = nn.length(); // create label vector std::vector label(n); std::iota(std::begin(label), std::end(label), 1); // Fill with 1, 2, ..., n. //iota is C++11 only //int value = 1; //std::vector::iterator first = label.begin(), last = label.end(); //while(first != last) *first++ = value++; // create sorted sets so we can use set operations std::vector< std::set > nn_set(n); IntegerVector r; std::vector s; for(R_xlen_t i = 0; i < n; ++i) { r = nn[i]; s = as >(r); nn_set[i].insert(s.begin(), s.end()); } std::set::iterator it; R_xlen_t i, j; int newlabel, oldlabel; for(i = 0; i < n; ++i) { // check all neighbors of i for (it = nn_set[i].begin(); it != nn_set[i].end(); ++it) { j = *it-1; // index in nn starts with 1 // edge was already checked //if(j label[j]) { newlabel = label[j]; oldlabel = label[i]; }else{ newlabel = label[i]; oldlabel = label[j]; } // relabel for(int k = 0; k < n; ++k) { if(label[k] == oldlabel) label[k] = newlabel; } } } } return wrap(label); } ================================================ FILE: src/dbcv.cpp ================================================ //---------------------------------------------------------------------- // DBSCAN // File: dbcv.cpp //---------------------------------------------------------------------- // Copyright (c) 2025 Michael Hahsler. All Rights Reserved. // // This software is provided under the provisions of the // GNU General Public License (GPL) Version 3 // (see: http://www.gnu.org/licenses/gpl-3.0.en.html) #include // Includes #include "utilities.h" #include "mst.h" #include "ANN/ANN.h" #include "kNN.h" #include #include using namespace Rcpp; // [[Rcpp::plugins(cpp11)]] // [[Rcpp::export]] StringVector intToStr(IntegerVector iv){ StringVector res = StringVector(iv.length()); int ci = 0; for (IntegerVector::iterator i = iv.begin(); i != iv.end(); ++i){ res[ci++] = std::to_string(*i); } return(res); } std::unordered_map toMap(List map){ std::vector keys = map.names(); std::unordered_map hash_map = std::unordered_map(); const int n = map.size(); for (int i = 0; i < n; ++i){ hash_map.emplace((std::string) keys.at(i), (double) map.at(i)); } return(hash_map); } NumericVector retrieve(StringVector keys, std::unordered_map map){ int n = keys.size(), i = 0; NumericVector res = NumericVector(n); for (StringVector::iterator it = keys.begin(); it != keys.end(); ++it){ res[i++] = map[as< std::string >(*it)]; } return(res); } NumericVector dist_subset_arma(const NumericVector& dist, IntegerVector idx){ // vec v1 = as(v1in); // uvec idx = as(idxin) - 1; // vec subset = v1.elem(idx); // return(wrap(subset)); return(NumericVector::create()); } // Provides a fast of extracting subsets of a dist object. Expects as input the full dist // object to subset 'dist', and a (1-based!) integer vector 'idx' of the points to keep in the subset // [[Rcpp::export]] NumericVector dist_subset(const NumericVector& dist, IntegerVector idx){ const int n = dist.attr("Size"); const int cl_n = idx.length(); NumericVector new_dist = Rcpp::no_init((cl_n * (cl_n - 1))/2); int ii = 0; for (IntegerVector::iterator i = idx.begin(); i != idx.end(); ++i){ for (IntegerVector::iterator j = i; j != idx.end(); ++j){ if (*i == *j) { continue; } const int ij_idx = LT_POS1(n, *i, *j); new_dist[ii++] = dist[ij_idx]; } } new_dist.attr("Size") = cl_n; new_dist.attr("class") = "dist"; return(new_dist); } // Returns true if a given distance is less than 32-bit floating point precision bool remove_zero(ANNdist cdist){ return(cdist <= std::numeric_limits::epsilon()); } ANNdist inv_density(ANNdist cdist){ return(1.0/cdist); } // // [[Rcpp::export]] // List all_pts_core_sorted_dist(const NumericMatrix& sorted_dist, const List& cl, const int d, const bool squared){ // // The all core dists to return // List all_core_res = List(cl.size()); // // // Do the kNN searches per cluster; note that k varies with the cluster // int i = 0; // for (List::const_iterator it = cl.begin(); it < cl.end(); ++it, ++i){ // const IntegerVector& cl_pts = (*it); // const int k = cl_pts.length(); // // // Initial vector to record the per-point all core dists // NumericVector all_core_cl = Rcpp::no_init_vector(k); // // // For each point in the cluster, get the all core points dist // int j = 0; // for (IntegerVector::const_iterator pt_id = cl_pts.begin(); pt_id != cl_pts.end(); ++pt_id, ++j){ // const NumericMatrix::ConstColumn& knn_dist = sorted_dist.column((*pt_id) - 1); // // // Calculate the all core points distance for this point // std::vector ndists = std::vector(knn_dist.begin(), knn_dist.begin()+k); // std::remove_if(ndists.begin(), ndists.end(), remove_zero); // std::transform(ndists.begin(), ndists.end(), ndists.begin(), [=](ANNdist cdist){ return std::pow(1.0/cdist, d); }); // ANNdist sum_inv_density = std::accumulate(ndists.begin(), ndists.end(), (ANNdist) 0.0); // double acdist = std::pow(sum_inv_density/(k - 1.0), -(1.0 / double(d))); // Apply all core points equation // all_core_cl[j] = acdist; // // return(List::create(_["ndists"] = acdist, _["denom"] = sum_inv_density/(k - 1.0), _["k"] = k)); // } // all_core_res[i] = all_core_cl; // } // return(all_core_res); // } // // [[Rcpp::export]] // List all_pts_core(const NumericMatrix& data, const List& cl, const bool squared){ // // copy data // int nrow = data.nrow(); // int ncol = data.ncol(); // ANNpointArray dataPts = annAllocPts(nrow, ncol); // for(int i = 0; i < nrow; i++){ // for(int j = 0; j < ncol; j++){ // (dataPts[i])[j] = data(i, j); // } // } // // // create kd-tree (1) or linear search structure (2) // ANNpointSet* kdTree = new ANNkd_tree(dataPts, nrow, ncol, 30, (ANNsplitRule) 5); // // // The all core dists to // List all_core_res = List(cl.size()); // // // Do the kNN searches per cluster; note that k varies with the cluster // int i = 0; // for (List::const_iterator it = cl.begin(); it < cl.end(); ++it, ++i){ // const IntegerVector& cl_pts = (*it); // const int k = cl_pts.length(); // // // Initial vector to record the per-point all core dists // NumericVector all_core_cl = Rcpp::no_init_vector(k); // // // For each point in the cluster, get the all core points dist // int j = 0; // ANNdistArray dists = new ANNdist[k]; // ANNidxArray nnIdx = new ANNidx[k]; // for (IntegerVector::const_iterator pt_id = cl_pts.begin(); pt_id != cl_pts.end(); ++pt_id, ++j){ // // Do the search // ANNpoint queryPt = dataPts[(*pt_id) - 1]; // use original data points // kdTree->annkSearch(queryPt, k, nnIdx, dists); // // // V2. // std::vector ndists = std::vector(dists, dists+k); // std::remove_if(ndists.begin(), ndists.end(), remove_zero); // std::transform(ndists.begin(), ndists.end(), ndists.begin(), [=](ANNdist cdist){ return std::pow(1.0/cdist, ncol); }); // ANNdist sum_inv_density = std::accumulate(ndists.begin(), ndists.end(), (ANNdist) 0.0); // double acdist = std::pow(sum_inv_density/(k - 1.0), -(1.0 / double(ncol))); // Apply all core points equation // all_core_cl[j] = acdist; // // return(List::create(_["ndists"] = acdist, _["denom"] = sum_inv_density/(k - 1.0), _["k"] = k)); // } // delete [] dists; // delete [] nnIdx; // all_core_res[i] = all_core_cl; // } // // // cleanup // delete kdTree; // annDeallocPts(dataPts); // annClose(); // // // Return the all point core distance // if(!squared){ for (int i = 0; i < cl.size(); ++i){ all_core_res[i] = Rcpp::sqrt(all_core_res[i]); } } // return(all_core_res); // } // NumericVector all_pts_core(const NumericVector& dist, IntegerVector cl, const int d){ // const int n = dist.attr("Size"); // const int cl_n = cl.length(); // NumericVector all_pts_cd = NumericVector(cl_n); // NumericVector tmp = NumericVector(cl_n); // int knn_i = 0, ii = 0; // for (IntegerVector::iterator i = cl.begin(); i != cl.end(); ++i){ // for (IntegerVector::iterator j = cl.begin(); j != cl.end(); ++j){ // if (*i == *j) { continue; } // const int idx = INDEX_TF(n, (*i < *j ? *i : *j) - 1, (*i < *j ? *j : *i) - 1); // double dist_ij = dist[idx]; // tmp[knn_i++] = 1.0 / (dist_ij == 0.0 ? std::numeric_limits::epsilon() : dist_ij); // } // all_pts_cd[ii++] = pow(sum(pow(tmp, d))/(cl_n - 1.0), -(1.0 / d)); // knn_i = 0; // } // return(all_pts_cd); // } // RCPP does not provide xor! // [[Rcpp::export]] Rcpp::LogicalVector XOR(Rcpp::LogicalVector lhs, Rcpp::LogicalVector rhs) { R_xlen_t i = 0, n = lhs.size(); Rcpp::LogicalVector result(n); for ( ; i < n; i++) { result[i] = (lhs[i] ^ rhs[i]); } return result; } // [[Rcpp::export]] NumericMatrix dspc(const List& cl_idx, const List& internal_nodes, const IntegerVector& all_cl_ids, const NumericVector& mrd_dist) { // Setup variables const int ncl = cl_idx.length(); // number of clusters NumericMatrix res = Rcpp::no_init_matrix((ncl * (ncl - 1))/2, 3); // resulting separation measures // Loop through cluster combinations, and for each combination int c = 0; double min_edge = std::numeric_limits::infinity(); for (int ci = 0; ci < ncl; ++ci) { for (int cj = (ci+1); cj < ncl; ++cj){ Rcpp::checkUserInterrupt(); // Do lots of indexing to get the relative indexes corresponding to internal nodes const IntegerVector i_idx = internal_nodes[ci], j_idx = internal_nodes[cj]; // i and j cluster point indices // ignore clusters with no internal nodes! -> get infinity for minimum edge // this leads to a NaN and should not happen in this implementation since // we have already filtered out clusters of size < 3 if(i_idx.length() > 1 || j_idx.length() > 1) { const IntegerVector rel_i_idx = match(as(cl_idx[ci]), all_cl_ids)[i_idx - 1]; const IntegerVector rel_j_idx = match(as(cl_idx[cj]), all_cl_ids)[j_idx - 1]; IntegerVector int_idx = combine(rel_i_idx, rel_j_idx); // Get the pairwise MST NumericMatrix pairwise_mst = mst(dist_subset(mrd_dist, int_idx), int_idx.length()); // Do lots of indexing / casting const IntegerVector from_int = seq_len(rel_i_idx.length()); const NumericVector from_idx = as(from_int); const NumericVector from = pairwise_mst.column(0), to = pairwise_mst.column(1), height = pairwise_mst.column(2); // Find which distances in the MST cross to both clusters LogicalVector cross_edges = XOR(Rcpp::in(from, from_idx), Rcpp::in(to, from_idx)); // The minimum weighted edge of these cross edges is the density separation between the two clusters min_edge = min(as(height[cross_edges])); } // Save the minimum edge res(c++, _) = NumericVector::create(ci+1, cj+1, min_edge); min_edge = std::numeric_limits::infinity(); } } return(res); } // Density Separation code // NumericMatrix dspc(List config, const NumericVector& xdist) { // // // Load configuration from list // const int n = config["n"]; // const int ncl = config["ncl"]; // const int n_pairs = config["n_pairs"]; // List node_ids = config["node_ids"]; // List acp = config["acp"]; // // // Conversions and basic setup // std::unordered_map acp_map = toMap(acp); // double min_mrd = std::numeric_limits::infinity(); // NumericMatrix min_mrd_dist = NumericMatrix(n_pairs, 3); // // // Loop through cluster combinations, and for each combination // int c = 0; // for (int ci = 0; ci < ncl; ++ci) { // for (int cj = (ci+1); cj < ncl; ++cj){ // Rcpp::checkUserInterrupt(); // IntegerVector i_idx = node_ids[ci], j_idx = node_ids[cj]; // i and j cluster point indices // for (IntegerVector::iterator i = i_idx.begin(); i != i_idx.end(); ++i){ // for (IntegerVector::iterator j = j_idx.begin(); j != j_idx.end(); ++j){ // const int lhs = *i < *j ? *i : *j, rhs = *i < *j ? *j : *i; // double dist_ij = xdist[INDEX_TF(n, lhs - 1, rhs - 1)]; // dist(p_i, p_j) // double acd_i = acp_map[std::to_string(*i)]; // all core distance for p_i // double acd_j = acp_map[std::to_string(*j)]; // all core distance for p_i // double mrd_ij = std::max(std::max(acd_i, acd_j), dist_ij); // mutual reachability distance of the pair // if (mrd_ij < min_mrd){ // min_mrd = mrd_ij; // } // } // } // min_mrd_dist(c++, _) = NumericVector::create(ci+1, cj+1, min_mrd); // min_mrd = std::numeric_limits::infinity(); // } // } // return(min_mrd_dist); // } ================================================ FILE: src/dbscan.cpp ================================================ //---------------------------------------------------------------------- // DBSCAN // File: R_dbscan.cpp //---------------------------------------------------------------------- // Copyright (c) 2015 Michael Hahsler. All Rights Reserved. // // This software is provided under the provisions of the // GNU General Public License (GPL) Version 3 // (see: http://www.gnu.org/licenses/gpl-3.0.en.html) #include #include "ANN/ANN.h" #include "regionQuery.h" using namespace Rcpp; // call this with either // * data and epsilon and an empty frNN list // or // * empty data and a frNN id list (including selfmatches and using C numbering) // [[Rcpp::export]] IntegerVector dbscan_int( NumericMatrix data, double eps, int minPts, NumericVector weights, int borderPoints, int type, int bucketSize, int splitRule, double approx, List frNN) { // kd-tree uses squared distances double eps2 = eps*eps; bool weighted = FALSE; double Nweight = 0.0; ANNpointSet* kdTree = NULL; ANNpointArray dataPts = NULL; int nrow = NA_INTEGER; int ncol= NA_INTEGER; if(frNN.size()) { // no kd-tree but use frNN list from distances nrow = frNN.size(); }else{ // copy data for kd-tree nrow = data.nrow(); ncol = data.ncol(); dataPts = annAllocPts(nrow, ncol); for (int i = 0; i < nrow; i++){ for (int j = 0; j < ncol; j++){ (dataPts[i])[j] = data(i, j); } } //Rprintf("Points copied.\n"); // create kd-tree (1) or linear search structure (2) if (type==1) kdTree = new ANNkd_tree(dataPts, nrow, ncol, bucketSize, (ANNsplitRule) splitRule); else kdTree = new ANNbruteForce(dataPts, nrow, ncol); //Rprintf("kd-tree ready. starting DBSCAN.\n"); } if (weights.size() != 0) { if (weights.size() != nrow) stop("length of weights vector is incompatible with data."); weighted = TRUE; } // DBSCAN std::vector visited(nrow, false); std::vector< std::vector > clusters; // vector of vectors == list std::vector N, N2; for (int i=0; i >(frNN[i]); else N = regionQuery(i, dataPts, kdTree, eps2, approx); // noise points stay unassigned for now //if (weighted) Nweight = sum(weights[IntegerVector(N.begin(), N.end())]) + if (weighted) { // This should work, but Rcpp has a problem with the sugar expression! // Assigning the subselection forces it to be materialized. // Nweight = sum(weights[IntegerVector(N.begin(), N.end())]) + // weights[i]; NumericVector w = weights[IntegerVector(N.begin(), N.end())]; Nweight = sum(w); } else Nweight = N.size(); if (Nweight < minPts) continue; // start new cluster and expand std::vector cluster; cluster.push_back(i); visited[i] = true; while (!N.empty()) { int j = N.back(); N.pop_back(); if (visited[j]) continue; // point already processed visited[j] = true; //N2 = regionQuery(j, dataPts, kdTree, eps2, approx); if(frNN.size()) N2 = Rcpp::as< std::vector >(frNN[j]); else N2 = regionQuery(j, dataPts, kdTree, eps2, approx); if (weighted) { // Nweight = sum(weights(NumericVector(N2.begin(), N2.end())) + // weights[j] NumericVector w = weights[IntegerVector(N2.begin(), N2.end())]; Nweight = sum(w); } else Nweight = N2.size(); if (Nweight >= minPts) { // expand neighborhood // this is faster than set_union and does not need sort! visited takes // care of duplicates. std::copy(N2.begin(), N2.end(), std::back_inserter(N)); } // for DBSCAN* (borderPoints==FALSE) border points are considered noise if(Nweight >= minPts || borderPoints) cluster.push_back(j); } // add cluster to list clusters.push_back(cluster); } // prepare cluster vector // unassigned points are noise (cluster 0) IntegerVector id(nrow, 0); for (std::size_t i=0; i #include #include #include "UnionFind.h" using namespace Rcpp; // Ditto with atoi! int fast_atoi( const char * str ) { int val = 0; while( *str ) { val = val*10 + (*str++ - '0'); } return val; } int which_int(IntegerVector x, int target) { int size = (int) x.size(); for (int i = 0; i < size; ++i) { if (x(i) == target) return(i); } return(-1); } // [[Rcpp::export]] List reach_to_dendrogram(const Rcpp::List reachability, const NumericVector pl_order) { // Set up sorted reachability distance NumericVector pl = Rcpp::clone(as(reachability["reachdist"])).sort(); // Get 0-based order IntegerVector order = Rcpp::clone(as(reachability["order"])) - 1; /// Initialize disjoint-set structure int n_nodes = order.size(); UnionFind uf((size_t) n_nodes); // Create leaves List dendrogram(n_nodes); for (int i = 0; i < n_nodes; ++i) { IntegerVector leaf = IntegerVector(); leaf.push_back(i+1); leaf.attr("label") = std::to_string(i + 1); leaf.attr("members") = 1; leaf.attr("height") = 0; leaf.attr("leaf") = true; dendrogram.at(i) = leaf; } // Precompute the q order IntegerVector q_order(n_nodes); for (int i = 0; i < n_nodes - 1; ++i) { q_order.at(i) = order(which_int(order, pl_order(i)) - 1); } // Get the index of the point with next smallest reach dist and its neighbor IntegerVector members(n_nodes, 1); int insert = 0, p = 0, q = 0, p_i = 0, q_i = 0; for (int i = 0; i < (n_nodes-1); ++i) { p = pl_order(i); q = q_order(i); // left neighbor in ordering if (q == -1) { stop("Left neighbor not found"); } // Get the actual index of the branch(es) containing the p and q p_i = uf.Find(p), q_i = uf.Find(q); List branch = List::create(dendrogram.at(q_i), dendrogram.at(p_i)); // generic proxy blocks attr access for mixed types, so keep track of members manually! branch.attr("members") = members.at(p_i) + members.at(q_i); branch.attr("height") = pl(i); branch.attr("class") = "dendrogram"; // Merge the two, retrieving the new index uf.Union(p_i, q_i); insert = uf.Find(q_i); // q because q_branch is first in the new branch // Update members reference and insert the branch members.at(insert) = branch.attr("members"); dendrogram.at(insert) = branch; } return(dendrogram.at(insert)); } int DFS(List d, List& rp, int pnode, NumericVector stack) { if (d.hasAttribute("leaf")) { // If at a leaf node, compare to previous node std::string leaf_label = as( d.attr("label") ); rp[leaf_label] = stack; // Record the ancestors reachability values std::string pnode_label = std::to_string(pnode); double new_reach = 0.0f; if(!rp.containsElementNamed(pnode_label.c_str())) { // 1st time seeing this point new_reach = INFINITY; } else { // Smallest Common Ancestor NumericVector reachdist_p = rp[pnode_label]; new_reach = min(intersect(stack, reachdist_p)); } NumericVector reachdist = rp["reachdist"]; IntegerVector order = rp["order"]; reachdist.push_back(new_reach); int res = fast_atoi(leaf_label.c_str()); order.push_back(res); rp["order"] = order; rp["reachdist"] = reachdist; return(res); } else { double cheight = d.attr("height"); stack.push_back(cheight); List left = d[0]; // Recursively go left, recording the reachability distances on the stack pnode = DFS(left, rp, pnode, stack); if (d.length() > 1) { for (int sub_branch = 1; sub_branch < d.length(); ++sub_branch) { pnode = DFS(d[sub_branch], rp, pnode, stack); // pnode; } } return(pnode); } } // [[Rcpp::export]] List dendrogram_to_reach(const Rcpp::List x) { Rcpp::List rp = List::create(_["order"] = IntegerVector::create(), _["reachdist"] = NumericVector::create()); NumericVector stack = NumericVector::create(); DFS(x, rp, 0, stack); List res = List::create(_["reachdist"] = rp["reachdist"], _["order"] = rp["order"]); res.attr("class") = "reachability"; return(res); } // [[Rcpp::export]] List mst_to_dendrogram(const NumericMatrix mst) { // Set up sorted vector values NumericVector p_order = mst(_, 0); NumericVector q_order = mst(_, 1); NumericVector dist = mst(_, 2); int n_nodes = p_order.length() + 1; // Make sure to clone so as to not make changes by reference p_order = Rcpp::clone(p_order); q_order = Rcpp::clone(q_order); // UnionFind data structure for fast agglomerative building UnionFind uf((size_t) n_nodes); // Create leaves List dendrogram(n_nodes); for (int i = 0; i < n_nodes; ++i) { IntegerVector leaf = IntegerVector(); leaf.push_back(i+1); leaf.attr("label") = std::to_string(i + 1); leaf.attr("members") = 1; leaf.attr("height") = 0; leaf.attr("leaf") = true; dendrogram.at(i) = leaf; } // Get the index of the point with next smallest reach dist and its neighbor IntegerVector members(n_nodes, 1); int insert = 0, p = 0, q = 0, p_i = 0, q_i = 0; for (int i = 0; i < (n_nodes-1); ++i) { p = p_order(i), q = q_order(i); // Get the actual index of the branch(es) containing the p and q p_i = uf.Find(p), q_i = uf.Find(q); // Merge the two, retrieving the new index uf.Union(p_i, q_i); List branch = List::create(dendrogram.at(q_i), dendrogram.at(p_i)); insert = uf.Find(q_i); // q because q_branch is first in the new branch // Update members in the branch int tmp_members = members.at(p_i) + members.at(q_i); // Branches with equivalent distances are merged simultaneously while((i + 1) < (n_nodes-1) && dist(i + 1) == dist(i)){ i += 1; p = p_order(i), q = q_order(i); p_i = uf.Find(p), q_i = uf.Find(q); // Merge the branches, update current insert index int insert2 = uf.Find(q_i); branch.push_back(insert == insert2 ? dendrogram.at(p_i) : dendrogram.at(q_i)); tmp_members += insert == insert2 ? members.at(p_i) : members.at(q_i); uf.Union(p_i, q_i); insert = uf.Find(q_i); } // Generic proxy blocks attr access for mixed types, so need to keep track of members manually! branch.attr("height") = dist(i); branch.attr("class") = "dendrogram"; branch.attr("members") = tmp_members; // Update members reference and insert the branch members.at(insert) = branch.attr("members"); dendrogram.at(insert) = branch; } return(dendrogram.at(insert)); } ================================================ FILE: src/density.cpp ================================================ //---------------------------------------------------------------------- // DBSCAN density //---------------------------------------------------------------------- // Copyright (c) 2015 Michael Hahsler. All Rights Reserved. // // This software is provided under the provisions of the // GNU General Public License (GPL) Version 3 // (see: http://www.gnu.org/licenses/gpl-3.0.en.html) #include #include "ANN/ANN.h" #include "regionQuery.h" using namespace Rcpp; // faster implementation of counting point densities from a matrix // using a kd-tree // [[Rcpp::export]] IntegerVector dbscan_density_int( NumericMatrix data, double eps, int type, int bucketSize, int splitRule, double approx) { // kd-tree uses squared distances double eps2 = eps*eps; ANNpointSet* kdTree = NULL; ANNpointArray dataPts = NULL; int nrow = NA_INTEGER; int ncol= NA_INTEGER; // copy data for kd-tree nrow = data.nrow(); ncol = data.ncol(); dataPts = annAllocPts(nrow, ncol); for (int i = 0; i < nrow; i++){ for (int j = 0; j < ncol; j++){ (dataPts[i])[j] = data(i, j); } } //Rprintf("Points copied.\n"); // create kd-tree (1) or linear search structure (2) if (type==1) kdTree = new ANNkd_tree(dataPts, nrow, ncol, bucketSize, (ANNsplitRule) splitRule); else kdTree = new ANNbruteForce(dataPts, nrow, ncol); //Rprintf("kd-tree ready. starting DBSCAN.\n"); std::vector N; IntegerVector count(nrow); for (int i=0; i #include "ANN/ANN.h" #include "regionQuery.h" using namespace Rcpp; // [[Rcpp::export]] List frNN_int(NumericMatrix data, double eps, int type, int bucketSize, int splitRule, double approx) { // kd-tree uses squared distances double eps2 = eps*eps; // copy data int nrow = data.nrow(); int ncol = data.ncol(); ANNpointArray dataPts = annAllocPts(nrow, ncol); for(int i = 0; i < nrow; i++){ for(int j = 0; j < ncol; j++){ (dataPts[i])[j] = data(i, j); } } //Rprintf("Points copied.\n"); // create kd-tree (1) or linear search structure (2) ANNpointSet* kdTree = NULL; if (type==1){ kdTree = new ANNkd_tree(dataPts, nrow, ncol, bucketSize, (ANNsplitRule) splitRule); } else{ kdTree = new ANNbruteForce(dataPts, nrow, ncol); } //Rprintf("kd-tree ready. starting DBSCAN.\n"); // frNN //std::vector< IntegerVector > id; id.resize(nrow); //std::vector< NumericVector > dist; dist.resize(nrow); List id(nrow); List dist(nrow); for (int p=0; p(), 1 ) ); // take sqrt of distance since the tree stores d^2 //std::transform(N.second.begin(), N.second.end(), // N.second.begin(), static_cast(std::sqrt)); IntegerVector ids = IntegerVector(N.first.begin(), N.first.end()); NumericVector dists = NumericVector(N.second.begin(), N.second.end()); // remove self matches LogicalVector take = ids != p; ids = ids[take]; dists = dists[take]; //Rprintf("Found neighborhood size %d\n", ids.size()); id[p] = ids+1; dist[p] = sqrt(dists); } // cleanup delete kdTree; annDeallocPts(dataPts); // annClose(); is now done globally in the package // prepare results List ret; ret["dist"] = dist; ret["id"] = id; ret["eps"] = eps; return ret; } // [[Rcpp::export]] List frNN_query_int(NumericMatrix data, NumericMatrix query, double eps, int type, int bucketSize, int splitRule, double approx) { // kd-tree uses squared distances double eps2 = eps*eps; // copy data int nrow = data.nrow(); int ncol = data.ncol(); ANNpointArray dataPts = annAllocPts(nrow, ncol); for(int i = 0; i < nrow; i++){ for(int j = 0; j < ncol; j++){ (dataPts[i])[j] = data(i, j); } } int nrow_q = query.nrow(); int ncol_q = query.ncol(); ANNpointArray queryPts = annAllocPts(nrow_q, ncol_q); for(int i = 0; i < nrow_q; i++){ for(int j = 0; j < ncol_q; j++){ (queryPts[i])[j] = query(i, j); } } //Rprintf("Points copied.\n"); // create kd-tree (1) or linear search structure (2) ANNpointSet* kdTree = NULL; if (type==1){ kdTree = new ANNkd_tree(dataPts, nrow, ncol, bucketSize, (ANNsplitRule) splitRule); } else{ kdTree = new ANNbruteForce(dataPts, nrow, ncol); } //Rprintf("kd-tree ready. starting DBSCAN.\n"); // frNN //std::vector< IntegerVector > id; id.resize(nrow); //std::vector< NumericVector > dist; dist.resize(nrow); List id(nrow_q); List dist(nrow_q); for (int p=0; p(), 1 ) ); // take sqrt of distance since the tree stores d^2 //std::transform(N.second.begin(), N.second.end(), // N.second.begin(), static_cast(std::sqrt)); IntegerVector ids = IntegerVector(N.first.begin(), N.first.end()); NumericVector dists = NumericVector(N.second.begin(), N.second.end()); // remove self matches -- not an issue with query points //LogicalVector take = ids != p; //ids = ids[take]; //dists = dists[take]; //Rprintf("Found neighborhood size %d\n", ids.size()); id[p] = ids+1; dist[p] = sqrt(dists); } // cleanup delete kdTree; annDeallocPts(dataPts); annDeallocPts(queryPts); // annClose(); is now done globally in the package // prepare results List ret; ret["dist"] = dist; ret["id"] = id; ret["eps"] = eps; ret["sort"] = false; return ret; } ================================================ FILE: src/hdbscan.cpp ================================================ //---------------------------------------------------------------------- // R interface to dbscan using the ANN library //---------------------------------------------------------------------- // Copyright (c) 2015 Michael Hahsler, Matt Piekenbrock. All Rights Reserved. // // This software is provided under the provisions of the // GNU General Public License (GPL) Version 3 // (see: http://www.gnu.org/licenses/gpl-3.0.en.html) #include // C++ includes #include #include #include #include // std::atoi // Helper functions #include "utilities.h" using namespace Rcpp; // [[Rcpp::plugins(cpp11)]] // Macros #define INDEX_TF(N,to,from) (N)*(to) - (to)*(to+1)/2 + (from) - (to) - (1) // Given a dist vector of "should-link" (1), "should-not-link" (-1), and "don't care" (0) // constraints in the form of integers, convert constraints to a more compact adjacency list // representation. // [[Rcpp::export]] List distToAdjacency(IntegerVector constraints, const int N){ std::unordered_map > key_map = std::unordered_map >(); for (int i = 0; i < N; ++i){ for (int j = 0; j < N; ++j){ if (i == j) continue; int index = i > j ? INDEX_TF(N, j, i) : INDEX_TF(N, i, j); int crule = constraints.at(index); if (crule != 0){ if (key_map.count(i+1) != 1){ key_map[i+1] = std::vector(); } // add 1 for base 1 key_map[i+1].push_back(crule < 0 ? - (j + 1) : j + 1); // add 1 for base 1 } } } return(wrap(key_map)); } // Given an hclust object, convert to a dendrogram object (but much faster). // [[Rcpp::export]] List buildDendrogram(List hcl) { // Extract hclust info IntegerMatrix merge = hcl["merge"]; NumericVector height = hcl["height"]; IntegerVector order = hcl["order"]; List labels = List(); // allows to avoid type inference if (!hcl.containsElementNamed("labels") || hcl["labels"] == R_NilValue){ labels = seq_along(order); } else { labels = as(hcl["labels"]); } int n = merge.nrow() + 1, k; List new_br, z = List(n); for (k = 0; k < n-1; k++){ int lm = merge(k, 0), rm = merge(k, 1); IntegerVector m = IntegerVector::create(lm, rm); // First Case: Both are singletons, so need to create leaves if (all(m < 0).is_true()){ // Left IntegerVector left = IntegerVector::create(-lm); left.attr("members") = (int) 1; left.attr("height") = (double) 0.f; left.attr("label") = labels.at(-(lm + 1)); left.attr("leaf") = true; // Right IntegerVector right = IntegerVector::create(-rm); right.attr("members") = (int) 1; right.attr("height") = (double) 0.f; right.attr("label") = labels.at(-(rm + 1)); right.attr("leaf") = true; // Merge new_br = List::create(left, right); new_br.attr("members") = 2; new_br.attr("midpoint") = 0.5; } // Second case: 1 is a singleton, the other is a branch else if (any(m < 0).is_true()){ bool isL = lm < 0; // Create the leaf from the negative entry IntegerVector leaf = IntegerVector::create(isL ? -lm : -rm); leaf.attr("members") = 1; leaf.attr("height") = 0; leaf.attr("label") = labels.at(isL ? -(lm + 1) : -(rm + 1)); leaf.attr("leaf") = true; // Merge the leaf with the other existing branch int branch_key = isL ? rm - 1 : lm - 1; List sub_branch = z[branch_key]; new_br = isL ? List::create(leaf, sub_branch) : List::create(sub_branch, leaf); z.at(branch_key) = R_NilValue; // Set attributes of new branch int sub_members = sub_branch.attr("members"); double mid_pt = sub_branch.attr("midpoint"); new_br.attr("members") = int(sub_members) + 1; new_br.attr("midpoint") = (int(isL ? 1 : sub_members) + mid_pt) / 2; } else { // Create the new branch List l_branch = z.at(lm - 1), r_branch = z.at(rm - 1); new_br = List::create(l_branch, r_branch); // Store attribute valeus in locals to get around proxy int left_members = l_branch.attr("members"), right_members = r_branch.attr("members"); double l_mid = l_branch.attr("midpoint"), r_mid = r_branch.attr("midpoint"); // Set up new branch attributes new_br.attr("members") = left_members + right_members; new_br.attr("midpoint") = (left_members + l_mid + r_mid) / 2; // Deallocate unneeded memory along the way z.at(lm - 1) = R_NilValue; z.at(rm - 1) = R_NilValue; } new_br.attr("height") = height.at(k); z.at(k) = new_br; } List res = z.at(k - 1); res.attr("class") = "dendrogram"; return(res); } // Simple function to iteratively get the sub-children of a nested integer-hierarchy // [[Rcpp::export]] IntegerVector all_children(List hier, int key, bool leaves_only = false){ IntegerVector res = IntegerVector(); // If the key doesn't exist return an empty vector if (!hier.containsElementNamed(std::to_string(key).c_str())){ return(res); } // Else, do iterative 'recursive' type function to extract all the IDs of // all sub trees IntegerVector children = hier[std::to_string(key).c_str()]; std::queue to_do = std::queue(); to_do.push(key); while (to_do.size() != 0){ int parent = to_do.front(); if (!hier.containsElementNamed(std::to_string(parent).c_str())){ to_do.pop(); } else { children = hier[std::to_string(parent).c_str()]; to_do.pop(); for (int n_children = 0; n_children < children.length(); ++n_children){ int child_id = children.at(n_children); if (leaves_only){ if (!hier.containsElementNamed(std::to_string(child_id).c_str())) { res.push_back(child_id); } } else { res.push_back(child_id); } to_do.push(child_id); } } } return(res); } // Extract 'flat' assignments IntegerVector getSalientAssignments(List cl_tree, List cl_hierarchy, std::list sc, const int n){ IntegerVector cluster = IntegerVector(n, 0); for (std::list::iterator it = sc.begin(); it != sc.end(); it++) { IntegerVector child_cl = all_children(cl_hierarchy, *it); // If at a leaf, its not necessary to recursively get point indices, else need to traverse hierarchy if (child_cl.length() == 0){ List cl = cl_tree[std::to_string(*it)]; cluster[as(cl["contains"]) - 1] = *it; } else { List cl = cl_tree[std::to_string(*it)]; cluster[as(cl["contains"]) - 1] = *it; for (IntegerVector::iterator child_cid = child_cl.begin(); child_cid != child_cl.end(); ++child_cid){ cl = cl_tree[std::to_string(*child_cid)]; IntegerVector child_contains = as(cl["contains"]); if (child_contains.length() > 0){ cluster[child_contains - 1] = *it; } } } } return(cluster); } // Retrieve node (x, y) positions in a cluster tree // [[Rcpp::export]] NumericMatrix node_xy(List cl_tree, List cl_hierarchy, int cid = 0){ // Initialize if (cid == 0){ cl_tree["node_xy"] = NumericMatrix(all_children(cl_hierarchy, 0).size()+1, 2); cl_tree["leaf_counter"] = 0; cl_tree["row_counter"] = 0; } // Retrieve/set variables std::string cid_str = std::to_string(cid); NumericMatrix node_xy_ = cl_tree["node_xy"]; List cl = cl_tree[cid_str]; // Increment row index every time int row_index = (int) cl_tree["row_counter"]; cl_tree["row_counter"] = row_index+1; // base case if (!cl_hierarchy.containsElementNamed(cid_str.c_str())){ int leaf_index = (int) cl_tree["leaf_counter"]; node_xy_(row_index, _) = NumericVector::create((double) ++leaf_index, (double) cl["eps_death"]); cl_tree["leaf_counter"] = leaf_index; NumericMatrix res = NumericMatrix(1, 1); res[0] = row_index; return(res); } else { IntegerVector children = cl_hierarchy[cid_str]; int l_row = (int) node_xy(cl_tree, cl_hierarchy, children.at(0))[0]; // left int r_row = (int) node_xy(cl_tree, cl_hierarchy, children.at(1))[0]; // right double lvalue = (double) (node_xy_(l_row, 0) + node_xy_(r_row, 0)) / 2; node_xy_(row_index, _) = NumericVector::create(lvalue, (double) cl["eps_death"]); if (cid != 0){ NumericMatrix res = NumericMatrix(1, 1); res[0] = row_index; return(res); } } // Cleanup if (cid == 0){ cl_tree["leaf_counter"] = R_NilValue; cl_tree["row_counter"] = R_NilValue; } return (node_xy_); } // Given a cluster tree, convert to a simplified dendrogram // [[Rcpp::export]] List simplifiedTree(List cl_tree) { // Hierarchical information List cl_hierarchy = cl_tree.attr("cl_hierarchy"); IntegerVector all_childs = all_children(cl_hierarchy, 0); // To keep track of members and midpoints std::unordered_map members = std::unordered_map(); std::unordered_map mids = std::unordered_map(); // To keep track of where we are std::stack cid_stack = std::stack(); cid_stack.push(0); // Iteratively build the hierarchy List dendrogram = List(); // Premake children for (IntegerVector::iterator it = all_childs.begin(); it != all_childs.end(); ++it){ std::string cid_label = std::to_string(*it); List cl = cl_tree[cid_label]; if (!cl_hierarchy.containsElementNamed(cid_label.c_str())){ // Create leaf IntegerVector leaf = IntegerVector::create(*it); leaf.attr("label") = cid_label; leaf.attr("members") = 1; leaf.attr("height") = cl["eps_death"]; leaf.attr("midpoint") = 0; leaf.attr("leaf") = true; dendrogram[cid_label] = leaf; members[cid_label] = 1; mids[cid_label] = 0; } } // Building the dendrogram bottom-up while(!cid_stack.empty()) { int cid = cid_stack.top(); std::string cid_label = std::to_string(cid); List cl = cl_tree[cid_label]; // Recursive calls IntegerVector local_children = cl_hierarchy[cid_label]; // Members and midpoint extraction std::string l_str = std::to_string(local_children.at(0)), r_str = std::to_string(local_children.at(1)); // Rcout << "Comparing: " << l_str << ", " << r_str << std::endl; if (!dendrogram.containsElementNamed(l_str.c_str())){ cid_stack.push(local_children.at(0)); continue; } if (!dendrogram.containsElementNamed(r_str.c_str())){ cid_stack.push(local_children.at(1)); continue; } // Continue building up the hierarchy List left = dendrogram[l_str], right = dendrogram[r_str]; int l_members = members[l_str], r_members = members[r_str]; float l_mid = mids[l_str], r_mid = mids[r_str]; // Make the new branch List new_branch = List::create(dendrogram[l_str], dendrogram[r_str]); new_branch.attr("label") = cid_label; new_branch.attr("members") = l_members + r_members; new_branch.attr("height") = (float) cl["eps_death"]; new_branch.attr("class") = "dendrogram"; // Midpoint calculation bool isL = (bool) !cl_hierarchy.containsElementNamed(l_str.c_str()); // is left a leaf if (!isL && cl_hierarchy.containsElementNamed(r_str.c_str())){ // is non-singleton merge new_branch.attr("midpoint") = (l_members + l_mid + r_mid) / 2; } else { // contains a leaf int sub_members = isL ? r_members : l_members; float mid_pt = isL ? r_mid : l_mid; new_branch.attr("midpoint") = ((isL ? 1 : sub_members) + mid_pt) / 2; } // Save info for later members[cid_label] = l_members + r_members; mids[cid_label] = (float) new_branch.attr("midpoint"); dendrogram[cid_label] = new_branch; // Done with this node cid_stack.pop(); } return(dendrogram["0"]); } /* Main processing step to compute all the relevent information in the form of the * 'cluster tree' for FOSC. The cluster stability scores are computed via the tree traversal rely on a separate function * Requires information associated with hclust elements. See ?hclust in R for more info. * 1. merge := an (n-1) x d matrix representing the MST computed from any arbitrary similarity matrix * 2. height := the (linkage) distance each new set of clusters forms from the MST * 3. order := the point indices of the original data the negative entries in merge refer to * Notation: eps is used to arbitrarily refer to the dissimilarity distance used */ // [[Rcpp::export]] List computeStability(const List hcl, const int minPts, bool compute_glosh = false){ // Extract hclust info NumericMatrix merge = hcl["merge"]; NumericVector eps_dist = hcl["height"]; IntegerVector pt_order = hcl["order"]; int n = merge.nrow() + 1, k; // Which cluster does each merge step represent (after the merge, or before the split) IntegerVector cl_tracker = IntegerVector(n-1 , 0), member_sizes = IntegerVector(n-1, 0); // Size each step List clusters = List(), // Final cluster information cl_hierarchy = List(); // Keeps track of hierarchy, which cluster contains who // The primary information needed std::unordered_map contains = std::unordered_map(); std::unordered_map eps = std::unordered_map(); // Supplemental information for either conveniance or to reduce memory std::unordered_map n_children = std::unordered_map(); std::unordered_map eps_death = std::unordered_map(); std::unordered_map eps_birth = std::unordered_map(); std::unordered_map processed = std::unordered_map(); // First pass: Agglomerate up the hierarchy, recording member sizes. // This enables a dynamic programming strategy to improve performance below. for (k = 0; k < n-1; ++k){ int lm = merge(k, 0), rm = merge(k, 1); IntegerVector m = IntegerVector::create(lm, rm); if (all(m < 0).is_true()){ member_sizes[k] = 2; } else if (any(m < 0).is_true()) { int pos_merge = (lm < 0 ? rm : lm), merge_size = member_sizes[pos_merge - 1]; member_sizes[k] = merge_size + 1; } else { // Record Member Sizes int merge_size1 = member_sizes[lm-1], merge_size2 = member_sizes[rm-1]; member_sizes[k] = merge_size1 + merge_size2; } } // Initialize root (unknown size, might be 0, so don't initialize length) std::string root_str = "0"; contains[root_str] = NumericVector(); eps[root_str] = NumericVector(); eps_birth[root_str] = eps_dist.at(eps_dist.length()-1); int global_cid = 0; // Second pass: Divisively split the hierarchy, recording the epsilon and point index values as needed for (k = n-2; k >= 0; --k){ // Current Merge int lm = merge(k, 0), rm = merge(k, 1), cid = cl_tracker.at(k); IntegerVector m = IntegerVector::create(lm, rm); std::string cl_cid = std::to_string(cid); // Trivial case: split into singletons, record eps, contains, and ensure eps_death is minimal if (all(m < 0).is_true()){ contains[cl_cid].push_back(-lm), contains[cl_cid].push_back(-rm); double noise_eps = processed[cl_cid] ? eps_death[cl_cid] : eps_dist.at(k); eps[cl_cid].push_back(noise_eps), eps[cl_cid].push_back(noise_eps); eps_death[cl_cid] = processed[cl_cid] ? eps_death[cl_cid] : std::min((double) eps_dist.at(k), (double) eps_death[cl_cid]); } else if (any(m < 0).is_true()) { // Record new point info and mark the non-singleton with the cluster id contains[cl_cid].push_back(-(lm < 0 ? lm : rm)); eps[cl_cid].push_back(processed[cl_cid] ? eps_death[cl_cid] : eps_dist.at(k)); cl_tracker.at((lm < 0 ? rm : lm) - 1) = cid; } else { int merge_size1 = member_sizes[lm-1], merge_size2 = member_sizes[rm-1]; // The minPts step if (merge_size1 >= minPts && merge_size2 >= minPts){ // Record death of current cluster eps_death[cl_cid] = eps_dist.at(k); processed[cl_cid] = true; // Mark the lower merge steps as new clusters cl_hierarchy[cl_cid] = IntegerVector::create(global_cid+1, global_cid+2); std::string l_index = std::to_string(global_cid+1), r_index = std::to_string(global_cid+2); cl_tracker.at(lm - 1) = ++global_cid, cl_tracker.at(rm - 1) = ++global_cid; // Record the distance the new clusters appeared and initialize containers contains[l_index] = IntegerVector(), contains[r_index] = IntegerVector(); eps[l_index] = NumericVector(), eps[r_index] = NumericVector(); ; eps_birth[l_index] = eps_dist.at(k), eps_birth[r_index] = eps_dist.at(k); eps_death[l_index] = eps_dist.at(lm - 1), eps_death[r_index] = eps_dist.at(rm - 1); processed[l_index] = false, processed[r_index] = false; n_children[cl_cid] = merge_size1 + merge_size2; } else { // Inherit cluster identity cl_tracker.at(lm - 1) = cid, cl_tracker.at(rm - 1) = cid; } } } // Aggregate data into a returnable list // NOTE: the 'contains' element will be empty for all inner nodes w/ minPts == 1, else // it will contain only the objects that were considered 'noise' at that hierarchical level List res = List(); NumericVector outlier_scores; if (compute_glosh) { outlier_scores = NumericVector( n, -1.0); } for (std::unordered_map::iterator key = contains.begin(); key != contains.end(); ++key){ int nc = n_children[key->first]; res[key->first] = List::create( _["contains"] = key->second, _["eps"] = eps[key->first], _["eps_birth"] = eps_birth[key->first], _["eps_death"] = eps_death[key->first], _["stability"] = sum(1/eps[key->first] - 1/eps_birth[key->first]) + (nc * 1/eps_death[key->first] - nc * 1/eps_birth[key->first]), //_["_stability"] = 1/eps[key->first] - 1/eps_birth[key->first], _["n_children"] = n_children[key->first] ); // Compute GLOSH outlier scores (HDBSCAN only) if (compute_glosh){ if (eps[key->first].size() > 0){ // contains noise points double eps_max = std::numeric_limits::infinity(); IntegerVector leaf_membership = all_children(cl_hierarchy, atoi(key->first.c_str()), true); if (leaf_membership.length() == 0){ // is itself a leaf eps_max = eps_death[key->first]; } else { for (IntegerVector::iterator it = leaf_membership.begin(); it != leaf_membership.end(); ++it){ eps_max = std::min(eps_max, eps_death[std::to_string(*it)]); } } NumericVector eps_max_vec = NumericVector(eps[key->first].size(), eps_max) / as(eps[key->first]); NumericVector glosh = Rcpp::rep(1.0, key->second.length()) - eps_max_vec; outlier_scores[key->second - 1] = glosh; } // MFH: If the point is never an outlier (0/0) then set GLOSH to 0 outlier_scores[is_nan(outlier_scores)] = 0.0; } } // Store meta-data as attributes res.attr("n") = n; // number of points in the original data res.attr("cl_hierarchy") = cl_hierarchy; // Stores parent/child structure res.attr("cl_tracker") = cl_tracker; // stores cluster id formation for each merge step, used for cluster extraction res.attr("minPts") = minPts; // needed later // res.attr("root") = minPts == 1; // needed later to ensure root is not captured as a cluster if (compute_glosh){ res.attr("glosh") = outlier_scores; } // glosh outlier scores (hdbscan only) return(res); } // Validates a given list of instance-level constraints for symmetry. Since the number of // constraints might change dramatically based on the problem, and initial loop is performed // to figure out whether it would be faster to check via an adjacencty list or matrix // [[Rcpp::export]] List validateConstraintList(List& constraints, int n){ std::vector< std::string > keys = as< std::vector< std::string > >(constraints.names()); bool is_valid = true, tmp_valid, use_matrix = false; int n_constraints = 0; for (List::iterator it = constraints.begin(); it != constraints.end(); ++it){ n_constraints += as(*it).size(); } // Sparsity check: if the constraints make up a sufficiently large amount of // the solution space, use matrix to check validity if (n_constraints/(n*n) > 0.20){ use_matrix = true; } // Check using adjacency matrix if (use_matrix){ IntegerMatrix adj_matrix = IntegerMatrix(Dimension(n, n)); int from, to; for (std::vector< std::string >::iterator it = keys.begin(); it != keys.end(); ++it){ // Get constraints int cid = atoi(it->c_str()); // to base-0 IntegerVector cs_ = constraints[*it]; // Positive "should-link" constraints IntegerVector pcons = as(cs_[cs_ > 0]); for (IntegerVector::iterator pc = pcons.begin(); pc != pcons.end(); ++pc){ from = (*pc < cid ? *pc : cid) - 1; to = (*pc > cid ? *pc : cid) - 1; adj_matrix(from, to) = 1; } // Negative "should-not-link" constraints IntegerVector ncons = -(as(cs_[cs_ < 0])); for (IntegerVector::iterator nc = ncons.begin(); nc != ncons.end(); ++nc){ from = (*nc < cid ? *nc : cid) - 1; to = (*nc > cid ? *nc : cid) - 1; adj_matrix(from, to) = -1; } } // Check symmetry IntegerVector lower = lowerTri(adj_matrix); IntegerMatrix adj_t = Rcpp::transpose(adj_matrix); IntegerVector lower_t = lowerTri(adj_t); LogicalVector valid_check = lower == lower_t; is_valid = all(valid_check == TRUE).is_true(); // Try to merge the two if (!is_valid){ int sum = 0; for (int i = 0; i < lower.size(); ++i){ sum = lower.at(i) + lower_t.at(i); lower[i] = sum > 0 ? 1 : sum < 0 ? -1 : 0; } } constraints = distToAdjacency(lower, n); } // Else check using given adjacency list else { for (std::vector< std::string >::iterator it = keys.begin(); it != keys.end(); ++it){ // Get constraints int cid = atoi(it->c_str()); IntegerVector cs_ = constraints[*it]; // Positive "should-link" constraints IntegerVector pcons = as(cs_[cs_ > 0]); for (IntegerVector::iterator pc = pcons.begin(); pc != pcons.end(); ++pc){ int ic = *pc < 0 ? -(*pc) : *pc; std::string ic_str = std::to_string(ic); bool exists = constraints.containsElementNamed(ic_str.c_str()); tmp_valid = exists ? contains(as(constraints[ic_str]), cid) : false; if (!tmp_valid){ if (!exists){ constraints[ic_str] = IntegerVector::create(cid); } else { IntegerVector con_vec = constraints[ic_str]; con_vec.push_back(cid); constraints[ic_str] = con_vec; } is_valid = false; } } // Negative "should-not-link" constraints IntegerVector ncons = -(as(cs_[cs_ < 0])); for (IntegerVector::iterator nc = ncons.begin(); nc != ncons.end(); ++nc){ int ic = *nc < 0 ? -(*nc) : *nc; std::string ic_str = std::to_string(ic); bool exists = constraints.containsElementNamed(ic_str.c_str()); tmp_valid = exists ? contains(as(constraints[ic_str]), cid) : false; if (!tmp_valid){ if (!exists){ constraints[ic_str] = IntegerVector::create(-cid); } else { IntegerVector con_vec = constraints[ic_str]; con_vec.push_back(-cid); constraints[ic_str] = con_vec; } is_valid = false; } } } } // Produce warning if asymmetric constraints detected; return attempt at fixing constraints. if (!is_valid){ warning("Incomplete (asymmetric) constraints detected. Populating constraint list."); } return(constraints); } // [[Rcpp::export]] double computeVirtualNode(IntegerVector noise, List constraints){ if (noise.length() == 0) return(0); if (Rf_isNull(constraints)) return(0); // Semi-supervised extraction int satisfied_constraints = 0; // Rcout << "Starting constraint based optimization" << std::endl; for (IntegerVector::iterator it = noise.begin(); it != noise.end(); ++it){ std::string cs_str = std::to_string(*it); if (constraints.containsElementNamed(cs_str.c_str())){ // Get constraints IntegerVector cs_ = constraints[cs_str]; // Positive "should-link" constraints IntegerVector pcons = as(cs_[cs_ > 0]); for (IntegerVector::iterator pc = pcons.begin(); pc != pcons.end(); ++pc){ satisfied_constraints += contains(noise, *pc); } // Negative "should-not-link" constraints IntegerVector ncons = -(as(cs_[cs_ < 0])); for (IntegerVector::iterator nc = ncons.begin(); nc != ncons.end(); ++nc){ satisfied_constraints += (1 - contains(noise, *nc)); } } } return(satisfied_constraints); } // Framework for Optimal Selection of Clusters (FOSC) // Traverses a cluster tree hierarchy to compute a flat solution, maximizing the: // - Unsupervised soln: the 'most stable' clusters following the give linkage criterion // - SS soln w/ instance level Constraints: constraint-based w/ unsupervised tiebreaker // - SS soln w/ mixed objective function: maximizes J = α JU + (1 − α) JSS // [[Rcpp::export]] NumericVector fosc(List cl_tree, std::string cid, std::list& sc, List cl_hierarchy, bool prune_unstable_leaves=false, // whether to prune -very- unstable subbranches double cluster_selection_epsilon = 0.0, // whether to prune subbranches below a given epsilon const double alpha = 0, // mixed objective case bool useVirtual = false, // return virtual node as well const int n_constraints = 0, // number of constraints List constraints = R_NilValue) // instance-level constraints { // Base case: at a leaf if (!cl_hierarchy.containsElementNamed(cid.c_str())){ List cl = cl_tree[cid]; sc.push_back(std::atoi(cid.c_str())); // assume the leaf will be a salient cluster until proven otherwise return(NumericVector::create((double) cl["stability"], (double) useVirtual ? cl["vscore"] : 0)); } else { // Non-base case: at a merge of clusters, determine which to keep List cl = cl_tree[cid]; // Get child stability/constraint scores NumericVector scores, stability_scores = NumericVector(), constraint_scores = NumericVector(); IntegerVector child_ids = cl_hierarchy[cid]; for (int i = 0, clen = child_ids.length(); i < clen; ++i){ int child_id = child_ids.at(i); scores = fosc(cl_tree, std::to_string(child_id), sc, cl_hierarchy, prune_unstable_leaves, cluster_selection_epsilon, alpha, useVirtual, n_constraints, constraints); stability_scores.push_back(scores.at(0)); constraint_scores.push_back(scores.at(1)); } // If semisupervised scenario, normalizing should be stored in 'total_stability' double total_stability = (contains(cl_tree.attributeNames(),"total_stability") ? (double) cl_tree.attr("total_stability") : 1.0); // Compare and update stability scores double old_stability_score = (double) cl["stability"] / total_stability; double new_stability_score = (double) sum(stability_scores) / total_stability; // Compute instance-level constraints if necessary double old_constraint_score = 0, new_constraint_score = 0; if (useVirtual){ // Rcout << "old constraint score for " << cid << ": " << (double) cl["vscore"] << std::endl; old_constraint_score = (double) cl["vscore"]; new_constraint_score = (double) sum(constraint_scores) + (double) computeVirtualNode(cl["contains"], constraints)/n_constraints; } bool keep_children = true; // If the score is unchanged, remove the children and add parent if (useVirtual){ if (old_constraint_score < new_constraint_score && cid != "0"){ // Children satisfies more constraints cl["vscore"] = new_constraint_score; cl["score"] = alpha * new_stability_score + (1 - alpha) * new_constraint_score; // Rcout << "1: score for " << cid << ":" << (double) cl["score"] << std::endl; // Rcout << "(old constraint): " << old_constraint_score << ", (new constraint): " << new_constraint_score << std::endl; } else if (old_constraint_score > new_constraint_score && cid != "0"){ // Parent satisfies more constraints cl["vscore"] = old_constraint_score; cl["score"] = alpha * old_stability_score + (1 - alpha) * old_constraint_score; // Rcout << "2: score for " << cid << ":" << (double) cl["score"] << std::endl; keep_children = false; } else { // Resolve tie using unsupervised, stability-based approach if (old_stability_score < new_stability_score){ // Children are more stable cl["score"] = new_stability_score / total_stability; // Rcout << "3: score for " << cid << ":" << (double) cl["score"] << std::endl; } else { // Parent is more stable cl["score"] = old_stability_score / total_stability; // Rcout << "4: score for " << cid << ":" << (double) cl["score"] << std::endl; // Rcout << "(old stability): " << old_stability_score << ", (total stability): " << total_stability << std::endl; keep_children = false; } cl["vscore"] = old_constraint_score; } } else { // Use unsupervised, stability-based approach only if (old_stability_score < new_stability_score){ cl["score"] = new_stability_score; // keep children } else { cl["score"] = old_stability_score; keep_children = false; } } double epsdeath = (double) cl["eps_death"]; if (epsdeath < cluster_selection_epsilon){ keep_children = false; // prune children that emerge at distance below epsilon } // Prune children and add parent (cid) if need be if (!keep_children && cid != "0") { IntegerVector children = all_children(cl_hierarchy, std::atoi(cid.c_str())); // use all_children to prune subtrees for (int i = 0, clen = children.length(); i < clen; ++i){ sc.remove(children.at(i)); // use list for slightly better random deletion performance } sc.push_back(std::atoi(cid.c_str())); } else if (keep_children && prune_unstable_leaves){ // If flag passed, prunes leaves with insignificant stability scores // this can happen in cases where one leaf has a stability score significantly greater // than both its siblings and its parent (or other ancestors), causing sibling branches // to be considered as clusters even though they may nto be significantly more stable than their parent if (all(stability_scores < old_stability_score).is_false()){ for (int i = 0, clen = child_ids.length(); i < clen; ++i){ if (stability_scores.at(i) < old_stability_score){ IntegerVector to_prune = all_children(cl_hierarchy, child_ids.at(i)); // all sub members for (IntegerVector::iterator it = to_prune.begin(); it != to_prune.end(); ++it){ //Rcout << "Pruning: " << *it << std::endl; sc.remove(*it); } } } } } // Save scores for traversal up and for later cl_tree[cid] = cl; // Return this sub trees score return(NumericVector::create((double) cl["score"], useVirtual ? (double) cl["vscore"] : 0)); } } // Given a cluster tree object with computed stability precomputed scores from computeStability, // extract the 'most stable' or salient flat cluster assignments. The large number of derivable // arguments due to fosc being a recursive function // [[Rcpp::export]] List extractUnsupervised(List cl_tree, bool prune_unstable = false, double cluster_selection_epsilon = 0.0){ // Compute Salient Clusters std::list sc = std::list(); List cl_hierarchy = cl_tree.attr("cl_hierarchy"); int n = as(cl_tree.attr("n")); fosc(cl_tree, "0", sc, cl_hierarchy, prune_unstable, cluster_selection_epsilon); // Assume root node is always id == 0 // Store results as attributes cl_tree.attr("cluster") = getSalientAssignments(cl_tree, cl_hierarchy, sc, n); // Flat assignments cl_tree.attr("salient_clusters") = wrap(sc); // salient clusters return(cl_tree); } // [[Rcpp::export]] List extractSemiSupervised(List cl_tree, List constraints, float alpha = 0, bool prune_unstable_leaves = false, double cluster_selection_epsilon = 0.0){ // Rcout << "Starting semisupervised extraction..." << std::endl; List root = cl_tree["0"]; List cl_hierarchy = cl_tree.attr("cl_hierarchy"); int n = as(cl_tree.attr("n")); // Compute total number of constraints int n_constraints = 0; for (int i = 0, n = constraints.length(); i < n; ++i){ IntegerVector cl_constraints = constraints.at(i); n_constraints += cl_constraints.length(); } // Initialize root List cl = cl_tree["0"]; cl["vscore"] = 0; cl_tree["0"] = cl; // replace to keep changes // Compute initial gamma values or "virtual nodes" for both leaf and internal nodes IntegerVector cl_ids = all_children(cl_hierarchy, 0); for (IntegerVector::iterator it = cl_ids.begin(); it != cl_ids.end(); ++it){ if (*it != 0){ std::string cid_str = std::to_string(*it); List cl = cl_tree[cid_str]; // Store the initial fraction of constraints satisfied for each node as 'vscore' // NOTE: leaf scores represent \hat{gamma}, internal represent virtual node scores if (cl_hierarchy.containsElementNamed(cid_str.c_str())){ // Extract the point indices the cluster contains IntegerVector child_cl = all_children(cl_hierarchy, *it), child_ids; List cl_container = List(); for (IntegerVector::iterator ch_id = child_cl.begin(); ch_id != child_cl.end(); ++ch_id){ List ch_cl = cl_tree[std::to_string(*ch_id)]; //child_ids = combine(child_ids, ch_cl["contains"]); cl_container.push_back(as(ch_cl["contains"])); } cl_container.push_back(as(cl["contains"])); child_ids = concat_int(cl_container); cl["vscore"] = computeVirtualNode(child_ids, constraints)/n_constraints; } else { // is leaf node cl["vscore"] = computeVirtualNode(cl["contains"], constraints)/n_constraints; } cl_tree[cid_str] = cl; // replace to keep changes } } // First pass: compute unsupervised soln as a means of extracting normalizing constant J_U^* cl_tree = extractUnsupervised(cl_tree, false, cluster_selection_epsilon); IntegerVector stable_sc = cl_tree.attr("salient_clusters"); double total_stability = 0.0f; for (IntegerVector::iterator it = stable_sc.begin(); it != stable_sc.end(); ++it){ List cl = cl_tree[std::to_string(*it)]; total_stability += (double) cl["stability"]; } cl_tree.attr("total_stability") = total_stability; // Rcout << "Total stability: " << total_stability << std::endl; // Compute stable clusters w/ instance-level constraints std::list sc = std::list(); fosc(cl_tree, "0", sc, cl_hierarchy, prune_unstable_leaves, cluster_selection_epsilon, alpha, true, n_constraints, constraints); // semi-supervised parameters // Store results as attributes and return cl_tree.attr("salient_clusters") = wrap(sc); cl_tree.attr("cluster") = getSalientAssignments(cl_tree, cl_hierarchy, sc, n); return(cl_tree); } ================================================ FILE: src/kNN.cpp ================================================ //---------------------------------------------------------------------- // Find the k Nearest Neighbors // File: R_kNNdist.cpp //---------------------------------------------------------------------- // Copyright (c) 2015 Michael Hahsler. All Rights Reserved. // // This software is provided under the provisions of the // GNU General Public License (GPL) Version 3 // (see: http://www.gnu.org/licenses/gpl-3.0.en.html) // Note: does not return self-matches! #include "kNN.h" // returns knn + dist List kNN_int(NumericMatrix data, int k, int type, int bucketSize, int splitRule, double approx) { // copy data int nrow = data.nrow(); int ncol = data.ncol(); ANNpointArray dataPts = annAllocPts(nrow, ncol); for(int i = 0; i < nrow; i++){ for(int j = 0; j < ncol; j++){ (dataPts[i])[j] = data(i, j); } } //Rprintf("Points copied.\n"); // create kd-tree (1) or linear search structure (2) ANNpointSet* kdTree = NULL; if (type==1){ kdTree = new ANNkd_tree(dataPts, nrow, ncol, bucketSize, (ANNsplitRule) splitRule); } else{ kdTree = new ANNbruteForce(dataPts, nrow, ncol); } //Rprintf("kd-tree ready. starting DBSCAN.\n"); NumericMatrix d(nrow, k); IntegerMatrix id(nrow, k); // Note: the search also returns the point itself (as the first hit)! // So we have to look for k+1 points. ANNdistArray dists = new ANNdist[k+1]; ANNidxArray nnIdx = new ANNidx[k+1]; for (int i=0; iannkSearch(queryPt, k+1, nnIdx, dists, approx); // remove self match IntegerVector ids = IntegerVector(nnIdx, nnIdx+k+1); LogicalVector take = ids != i; ids = ids[take]; id(i, _) = ids + 1; NumericVector ndists = NumericVector(dists, dists+k+1)[take]; d(i, _) = sqrt(ndists); } // cleanup delete kdTree; delete [] dists; delete [] nnIdx; annDeallocPts(dataPts); // annClose(); is now done globally in the package // prepare results List ret; ret["dist"] = d; ret["id"] = id; ret["k"] = k; ret["sort"] = true; return ret; } // returns knn + dist using data and query // [[Rcpp::export]] List kNN_query_int(NumericMatrix data, NumericMatrix query, int k, int type, int bucketSize, int splitRule, double approx) { // FIXME: check ncol for data and query // copy data int nrow = data.nrow(); int ncol = data.ncol(); ANNpointArray dataPts = annAllocPts(nrow, ncol); for(int i = 0; i < nrow; i++){ for(int j = 0; j < ncol; j++){ (dataPts[i])[j] = data(i, j); } } // copy query int nrow_q = query.nrow(); int ncol_q = query.ncol(); ANNpointArray queryPts = annAllocPts(nrow_q, ncol_q); for(int i = 0; i < nrow_q; i++){ for(int j = 0; j < ncol_q; j++){ (queryPts[i])[j] = query(i, j); } } //Rprintf("Points copied.\n"); // create kd-tree (1) or linear search structure (2) ANNpointSet* kdTree = NULL; if (type==1){ kdTree = new ANNkd_tree(dataPts, nrow, ncol, bucketSize, (ANNsplitRule) splitRule); } else{ kdTree = new ANNbruteForce(dataPts, nrow, ncol); } //Rprintf("kd-tree ready. starting DBSCAN.\n"); NumericMatrix d(nrow_q, k); IntegerMatrix id(nrow_q, k); // Note: does not return itself with query ANNdistArray dists = new ANNdist[k]; ANNidxArray nnIdx = new ANNidx[k]; for (int i=0; iannkSearch(queryPt, k, nnIdx, dists, approx); IntegerVector ids = IntegerVector(nnIdx, nnIdx+k); id(i, _) = ids + 1; NumericVector ndists = NumericVector(dists, dists+k); d(i, _) = sqrt(ndists); } // cleanup delete kdTree; delete [] dists; delete [] nnIdx; annDeallocPts(dataPts); annDeallocPts(queryPts); // annClose(); is now done globally in the package // prepare results (ANN returns points sorted by distance) List ret; ret["dist"] = d; ret["id"] = id; ret["k"] = k; ret["sort"] = true; return ret; } ================================================ FILE: src/kNN.h ================================================ #ifndef KNN_H #define KNN_H #include #include "ANN/ANN.h" using namespace Rcpp; // returns knn + dist // [[Rcpp::export]] List kNN_int(NumericMatrix data, int k, int type, int bucketSize, int splitRule, double approx); #endif ================================================ FILE: src/lof.cpp ================================================ //---------------------------------------------------------------------- // Find the Neighbourhood for LOF // File: R_lof.cpp //---------------------------------------------------------------------- // Copyright (c) 2021 Michael Hahsler. All Rights Reserved. // // This software is provided under the provisions of the // GNU General Public License (GPL) Version 3 // (see: http://www.gnu.org/licenses/gpl-3.0.en.html) // LOF needs to find the k-NN distance and then how many points are within this // neighborhood. #include #include "regionQuery.h" using namespace Rcpp; // returns knn-dist and the neighborhood size as a matrix // [[Rcpp::export]] List lof_kNN(NumericMatrix data, int minPts, int type, int bucketSize, int splitRule, double approx) { // minPts includes the point itself; k does not! int k = minPts - 1; // copy data int nrow = data.nrow(); int ncol = data.ncol(); ANNpointArray dataPts = annAllocPts(nrow, ncol); for(int i = 0; i < nrow; i++){ for(int j = 0; j < ncol; j++){ (dataPts[i])[j] = data(i, j); } } //Rprintf("Points copied.\n"); // create kd-tree (1) or linear search structure (2) ANNpointSet* kdTree = NULL; if (type==1){ kdTree = new ANNkd_tree(dataPts, nrow, ncol, bucketSize, (ANNsplitRule) splitRule); } else{ kdTree = new ANNbruteForce(dataPts, nrow, ncol); } //Rprintf("kd-tree ready. starting DBSCAN.\n"); // Note: the search also returns the point itself (as the first hit)! // So we have to look for k+1 points. ANNdistArray dists = new ANNdist[k+1]; ANNidxArray nnIdx = new ANNidx[k+1]; nn N; // results List id(nrow); List dist(nrow); NumericVector k_dist(nrow); for (int i=0; iannkSearch(queryPt, k+1, nnIdx, dists, approx); k_dist[i] = ANN_ROOT(dists[k]); // this is a squared distance! // find k-NN neighborhood which can be larger than k with tied distances // This works under Linux and Windows, but not under Solaris: The points at the // k_distance may not be included. //nn N = regionQueryDist_point(queryPt, dataPts, kdTree, dists[k], approx); // Make the comparison robust. // Compare doubles: http://c-faq.com/fp/fpequal.html double minPts_dist = dists[k] + DBL_EPSILON * dists[k]; nn N = regionQueryDist_point(queryPt, dataPts, kdTree, minPts_dist, approx); IntegerVector ids = IntegerVector(N.first.begin(), N.first.end()); NumericVector dists = NumericVector(N.second.begin(), N.second.end()); // remove self matches -- not an issue with query points LogicalVector take = ids != i; ids = ids[take]; dists = dists[take]; id[i] = ids+1; dist[i] = sqrt(dists); } // cleanup delete kdTree; delete [] dists; delete [] nnIdx; annDeallocPts(dataPts); // annClose(); is now done globally in the package // all k_dists are squared //k_dist = sqrt(k_dist); // prepare results List ret; ret["k_dist"] = k_dist; ret["ids"] = id; ret["dist"] = dist; return ret; } ================================================ FILE: src/lt.h ================================================ #ifndef LT #define LT /* LT_POS to access a lower triangle matrix by C. Buchta * modified by M. Hahsler * n ... number of rows/columns * i,j ... column and row index (starts with 1) * * LT_POS1 ... 1-based indexing * LT_POS0 ... 0-based indexing */ /* for long vectors, n, i, j need to be R_xlen_t */ #define LT_POS1(n, i, j) \ (i)==(j) ? 0 : (i)<(j) ? (n) * ((i) - 1) - (i)*((i)-1)/2 + (j)-(i) -1 \ : (n)*((j)-1) - (j)*((j)-1)/2 + (i)-(j) -1 #define LT_POS0(n, i, j) \ (i)==(j) ? 0 : (i)<(j) ? (n) * (i) - ((i) + 1)*(i)/2 + (j)-(i) -1 \ : (n)*(j) - ((j) + 1)*(j)/2 + (i)-(j) -1 /* M_POS to access matrix column-major order by i and j index (starts with 1) * n is the number of rows */ #define M_POS(n, i, j) ((i)+(n)*(j)) /* * MIN/MAX */ #define MIN(X,Y) ((X) < (Y) ? (X) : (Y)) #define MAX(X,Y) ((X) > (Y) ? (X) : (Y)) #endif ================================================ FILE: src/mrd.cpp ================================================ //---------------------------------------------------------------------- // R interface to dbscan using the ANN library //---------------------------------------------------------------------- // Copyright (c) 2015 Michael Hahsler, Matt Piekenbrock. All Rights Reserved. // // This software is provided under the provisions of the // GNU General Public License (GPL) Version 3 // (see: http://www.gnu.org/licenses/gpl-3.0.en.html) #include using namespace Rcpp; // Computes the mutual reachability distance defined for HDBSCAN // // The mutual reachability distance is a summary at what level two points together // will connect. The mutual reachability distance is defined as: // mrd(a, b) = max[core_distance(a), core_distance(b), distance(a, b)] // // Input: // * dm: distances as a dist object (vector) of size (n*(n-1))/2 where n // is the number of points. // Note: we divide by 2 early to stay within the number range of int. // * cd: the core distances as a vector of length n // // Returns: // a vector (dist object) in the same order as dm // [[Rcpp::export]] NumericVector mrd(NumericVector dm, NumericVector cd) { R_xlen_t n = cd.length(); if (dm.length() != (n * (n-1) / 2)) stop("number of mutual reachability distance values and size of the distance matrix do not agree."); NumericVector res = NumericVector(dm.length()); for (R_xlen_t i = 0, idx = 0; i < n; ++i) { // Rprintf("i = %ill of %ill, idx = %ill\n", i, n, idx); for (R_xlen_t j = i+1; j < n; ++j, ++idx) { res[idx] = std::max(dm[idx], std::max(cd[i], cd[j])); } } return res; } ================================================ FILE: src/mst.cpp ================================================ //---------------------------------------------------------------------- // R interface to dbscan using the ANN library //---------------------------------------------------------------------- // Copyright (c) 2015 Michael Hahsler, Matt Piekenbrock. All Rights Reserved. // // This software is provided under the provisions of the // GNU General Public License (GPL) Version 3 // (see: http://www.gnu.org/licenses/gpl-3.0.en.html) #include "mst.h" // coreFromDist indexes through the a dist vector to retrieve the core distance; // this might be useful in some situations. For example, you can get the core distance // from only a dist object, without needing the original data. In experimentation, the // kNNdist ended up being faster than this. // // // [[Rcpp::export]] // NumericVector coreFromDist(const NumericVector dist, const int n, const int minPts){ // NumericVector core_dist = NumericVector(n); // NumericVector row_dist = NumericVector(n - 1); // for (R_xlen_t i = 0; i < n; ++i){ // for (R_xlen_t j = 0; j < n; ++j){ // if (i == j) continue; // R_xlen_t index = LT_POS0(n, j, i) // row_dist.at(j > i ? j - 1 : j) = dist.at(index); // } // std::sort(row_dist.begin(), row_dist.end()); // core_dist[i] = row_dist.at(minPts-2); // one for 0-based indexes, one for inclusive minPts condition // } // return(core_dist); // } // Prim's Algorithm // this implementation for dense dist objects avoids the use of a min-heap. // [[Rcpp::export]] Rcpp::NumericMatrix mst(const NumericVector x_dist, const R_xlen_t n) { Rcpp::NumericMatrix mst = NumericMatrix(n - 1, 3); colnames(mst) = CharacterVector::create("from", "to", "weight"); // vector to store the parent of vertex std::vector parent(n); std::vector weight(n, INFINITY); std::vector visited(n, false); // first node is always the root of MST. parent[0] = -1; weight[0] = 0; int next_node = 0; double next_weight; int node; while (next_node >= 0) { node = next_node; next_node = -1; next_weight = INFINITY; visited[node] = true; mst(node-1, 1) = parent[node] +1; mst(node-1, 0) = node + 1; mst(node-1, 2) = weight[node]; for (int i = 1; i < n; i++) { // 0 is always the first node if (visited[i] || node == i) continue; double the_weight = x_dist[LT_POS0(n, node, i)]; if (the_weight < weight[i]) { weight[i] = the_weight; parent[i] = node; } // find minimum weight node if (weight[i] < next_weight) { next_weight = weight[i]; next_node = i; } } } return(mst); } // // // [[Rcpp::export]] // IntegerVector order_(NumericVector x) { // if (is_true(any(duplicated(x)))) { // Rf_warning("There are duplicates in 'x'; order not guaranteed to match that of R's base::order"); // } // NumericVector sorted = clone(x).sort(); // return match(sorted, x); // } // Single link hierarchical clustering // used by GLOSH.R and hdbscan.R void visit(const IntegerMatrix& merge, IntegerVector& order, int i, int j, int& ind) { // base case if (merge(i, j) < 0) { order.at(ind++) = -merge(i, j); } else { visit(merge, order, merge(i, j) - 1, 0, ind); visit(merge, order, merge(i, j) - 1, 1, ind); } } IntegerVector extractOrder(IntegerMatrix merge){ IntegerVector order = IntegerVector(merge.nrow()+1); int ind = 0; visit(merge, order, merge.nrow() - 1, 0, ind); visit(merge, order, merge.nrow() - 1, 1, ind); return(order); } // [[Rcpp::export]] List hclustMergeOrder(NumericMatrix mst, IntegerVector o){ int npoints = mst.nrow() + 1; NumericVector dist = mst(_, 2); // Extract order, reorder indices NumericVector left = mst(_, 0), right = mst(_, 1); IntegerVector left_int = as(left[o-1]), right_int = as(right[o-1]); // Labels and resulting merge matrix IntegerVector labs = -seq_len(npoints); IntegerMatrix merge = IntegerMatrix(npoints - 1, 2); // Replace singletons as negative and record merge of non-singletons as positive for (int i = 0; i < npoints - 1; ++i) { int lab_left = labs.at(left_int.at(i)-1), lab_right = labs.at(right_int.at(i)-1); merge(i, _) = IntegerVector::create(lab_left, lab_right); for (int c = 0; c < npoints; ++c){ if (labs.at(c) == lab_left || labs.at(c) == lab_right){ labs.at(c) = i+1; } } } //IntegerVector int_labels = seq_len(npoints); List res = List::create( _["merge"] = merge, _["height"] = dist[o-1], _["order"] = extractOrder(merge), _["labels"] = R_NilValue, //as(int_labels) _["method"] = "robust single", _["dist.method"] = "mutual reachability" ); res.attr("class") = "hclust"; return res; } ================================================ FILE: src/mst.h ================================================ #ifndef MST_H #define MST_H #include #include "lt.h" using namespace Rcpp; // Functions to compute MST and build hclust object out of the resulting tree NumericMatrix mst(const NumericVector x_dist, const R_xlen_t n); List hclustMergeOrder(NumericMatrix mst, IntegerVector o); #endif ================================================ FILE: src/optics.cpp ================================================ //---------------------------------------------------------------------- // OPTICS // File: R_optics.cpp //---------------------------------------------------------------------- // Copyright (c) 2015 Michael Hahsler, Matt Piekenbrock. All Rights Reserved. // // This software is provided under the provisions of the // GNU General Public License (GPL) Version 3 // (see: http://www.gnu.org/licenses/gpl-3.0.en.html) #include #include "ANN/ANN.h" #include "regionQuery.h" using namespace Rcpp; void update( std::pair< std::vector, std::vector > &N, int p, std::vector &seeds, int minPts, std::vector &visited, std::vector &orderedPoints, std::vector &reachdist, std::vector &coredist, std::vector &pre){ std::vector::iterator pos_seeds; double newreachdist; int o; double o_d; while(!N.first.empty()) { o = N.first.back(); o_d = N.second.back(); N.first.pop_back(); N.second.pop_back(); if(visited[o]) continue; newreachdist = std::max(coredist[p], o_d); if(reachdist[o] == INFINITY) { reachdist[o] = newreachdist; seeds.push_back(o); } else { // o was not visited and has a reachability distance must be // already in seeds! if(newreachdist < reachdist[o]) { reachdist[o] = newreachdist; pre[o] = p; } } } } // [[Rcpp::export]] List optics_int(NumericMatrix data, double eps, int minPts, int type, int bucketSize, int splitRule, double approx, List frNN) { // kd-tree uses squared distances double eps2 = eps*eps; ANNpointSet* kdTree = NULL; ANNpointArray dataPts = NULL; int nrow = NA_INTEGER; int ncol= NA_INTEGER; if(frNN.size()) { // no kd-tree nrow = (as(frNN["id"])).size(); }else{ // copy data for kd-tree nrow = data.nrow(); ncol = data.ncol(); dataPts = annAllocPts(nrow, ncol); for (int i = 0; i < nrow; i++){ for (int j = 0; j < ncol; j++){ (dataPts[i])[j] = data(i, j); } } //Rprintf("Points copied.\n"); // create kd-tree (1) or linear search structure (2) if (type==1) kdTree = new ANNkd_tree(dataPts, nrow, ncol, bucketSize, (ANNsplitRule) splitRule); else kdTree = new ANNbruteForce(dataPts, nrow, ncol); //Rprintf("kd-tree ready. starting OPTICS.\n"); } // OPTICS std::vector visited(nrow, false); std::vector orderedPoints; orderedPoints.reserve(nrow); std::vector pre(nrow, NA_INTEGER); std::vector reachdist(nrow, INFINITY); // we used Inf as undefined std::vector coredist(nrow, INFINITY); nn N; std::vector seeds; std::vector ds; for (int p=0; p >(as(frNN["id"])[p]), as >(as(frNN["dist"])[p])); else N = regionQueryDist(p, dataPts, kdTree, eps2, approx); visited[p] = true; // find core distance if(N.second.size() >= (size_t) minPts) { ds = N.second; std::sort(ds.begin(), ds.end()); // sort inceasing coredist[p] = ds[minPts-1]; } int tmp_p = NA_INTEGER; if (pre[p] == NA_INTEGER) { tmp_p = p; } orderedPoints.push_back(p); if (coredist[p] == INFINITY) continue; // core-dist is undefined // updateable priority queue does not exist in C++ STL so we use a vector! //seeds.clear(); // update update(N, p, seeds, minPts, visited, orderedPoints, reachdist, coredist, pre); int q; while (!seeds.empty()) { // get smallest dist (to emulate priority queue). All should have already // a reachability distance ::iterator q_it = seeds.begin(); for (std::vector::iterator it = seeds.begin(); it!=seeds.end(); ++it) { // Note: The second part of the if statement ensures that ties are // always broken consistenty (higher ID wins to produce the same // results as the elki implementation)! if (reachdist[*it] < reachdist[*q_it] || (reachdist[*it] == reachdist[*q_it] && *q_it < *it)) q_it = it; } q = *q_it; seeds.erase(q_it); //N2 = regionQueryDist(q, dataPts, kdTree, eps2, approx); if(frNN.size()) N = std::make_pair( as >(as(frNN["id"])[q]), as >(as(frNN["dist"])[q])); else N = regionQueryDist(q, dataPts, kdTree, eps2, approx); visited[q] = true; // update core distance if(N.second.size() >= (size_t) minPts) { ds = N.second; std::sort(ds.begin(), ds.end()); coredist[q] = ds[minPts - 1]; } if (pre[q] == NA_INTEGER) { pre[q] = tmp_p; } orderedPoints.push_back(q); if(N.first.size() < (size_t) minPts) continue; // == q has no core dist. // update seeds update(N, q, seeds, minPts, visited, orderedPoints, reachdist, coredist, pre); } } // cleanup if (kdTree != NULL) delete kdTree; if (dataPts != NULL) annDeallocPts(dataPts); // annClose(); is now done globally in the package // prepare results (R index starts with 1) List ret; ret["order"] = IntegerVector(orderedPoints.begin(), orderedPoints.end()) + 1; ret["reachdist"] = sqrt(NumericVector(reachdist.begin(), reachdist.end())); ret["coredist"] = sqrt(NumericVector(coredist.begin(), coredist.end())); ret["predecessor"] = IntegerVector(pre.begin(), pre.end()) + 1; return ret; } ================================================ FILE: src/regionQuery.cpp ================================================ //---------------------------------------------------------------------- // Region Query // File: R_regionQuery.cpp //---------------------------------------------------------------------- // Copyright (c) 2015 Michael Hahsler. All Rights Reserved. // // This software is provided under the provisions of the // GNU General Public License (GPL) Version 3 // (see: http://www.gnu.org/licenses/gpl-3.0.en.html) #include "regionQuery.h" using namespace Rcpp; // Note: Region query returns self-matches! // these function takes an id for the points in the k-d tree nn regionQueryDist(int id, ANNpointArray dataPts, ANNpointSet* kdTree, double eps2, double approx) { // find fixed radius nearest neighbors ANNpoint queryPt = dataPts[id]; std::pair< std::vector, std::vector > ret = kdTree->annkFRSearch2(queryPt, eps2, approx); // Note: the points are not sorted by distance! return(ret); } std::vector regionQuery(int id, ANNpointArray dataPts, ANNpointSet* kdTree, double eps2, double approx) { // find fixed radius nearest neighbors ANNpoint queryPt = dataPts[id]; std::pair< std::vector, std::vector > ret = kdTree->annkFRSearch2(queryPt, eps2, approx); // Note: the points are not sorted by distance! return(ret.first); } // these function takes an query point not in the tree nn regionQueryDist_point(ANNpoint queryPt, ANNpointArray dataPts, ANNpointSet* kdTree, double eps2, double approx) { // find fixed radius nearest neighbors std::pair< std::vector, std::vector > ret = kdTree->annkFRSearch2(queryPt, eps2, approx); // Note: the points are not sorted by distance! return(ret); } std::vector regionQuery_point(ANNpoint queryPt, ANNpointArray dataPts, ANNpointSet* kdTree, double eps2, double approx) { // find fixed radius nearest neighbors std::pair< std::vector, std::vector > ret = kdTree->annkFRSearch2(queryPt, eps2, approx); // Note: the points are not sorted by distance! return(ret.first); } ================================================ FILE: src/regionQuery.h ================================================ //---------------------------------------------------------------------- // Region Query // File: R_regionQuery.h //---------------------------------------------------------------------- // Copyright (c) 2015 Michael Hahsler. All Rights Reserved. // // This software is provided under the provisions of the // GNU General Public License (GPL) Version 3 // (see: http://www.gnu.org/licenses/gpl-3.0.en.html) #ifndef REGIONQUERY_H #define REGIONQUERY_H #include #include "ANN/ANN.h" using namespace Rcpp; // pair of ids and dists typedef std::pair< std::vector, std::vector > nn ; // Note: Region query returns self-matches! // these function takes an id for the points in the k-d tree nn regionQueryDist(int id, ANNpointArray dataPts, ANNpointSet* kdTree, double eps2, double approx = 0.0); std::vector regionQuery(int id, ANNpointArray dataPts, ANNpointSet* kdTree, double eps2, double approx = 0.0); // these function takes an query point not in the tree nn regionQueryDist_point(ANNpoint queryPt, ANNpointArray dataPts, ANNpointSet* kdTree, double eps2, double approx = 0.0); std::vector regionQuery_point(ANNpoint queryPt, ANNpointArray dataPts, ANNpointSet* kdTree, double eps2, double approx = 0.0); #endif ================================================ FILE: src/utilities.cpp ================================================ //---------------------------------------------------------------------- // R interface to dbscan using the ANN library //---------------------------------------------------------------------- // Copyright (c) 2015 Michael Hahsler. All Rights Reserved. // // This software is provided under the provisions of the // GNU General Public License (GPL) Version 3 // (see: http://www.gnu.org/licenses/gpl-3.0.en.html) #include "utilities.h" // extract the lower triangle from a matrix IntegerVector lowerTri(IntegerMatrix m) { int n = m.nrow(); IntegerVector lower_tri = IntegerVector(n * (n - 1) / 2); for (int i = 0, c = 0; i < n; ++i) { for (int j = i + 1; j < n; ++j) { if (i < j) lower_tri[c++] = m(i, j); } } return lower_tri; } NumericVector combine(const NumericVector& t1, const NumericVector& t2) { std::size_t n = t1.size() + t2.size(); NumericVector output = Rcpp::no_init(n); std::copy(t1.begin(), t1.end(), output.begin()); std::copy(t2.begin(), t2.end(), output.begin() + t1.size()); return output; } IntegerVector combine(const IntegerVector& t1, const IntegerVector& t2) { std::size_t n = t1.size() + t2.size(); IntegerVector output = Rcpp::no_init(n); std::copy(t1.begin(), t1.end(), output.begin()); std::copy(t2.begin(), t2.end(), output.begin() + t1.size()); return output; } // Faster version of above combine function, assuming you can precompute and store // the containers needing to be concatenated IntegerVector concat_int(List const& container) { int total_length = 0; for (List::const_iterator it = container.begin(); it != container.end(); ++it) { total_length += as(*it).size(); } int pos = 0; IntegerVector output = Rcpp::no_init(total_length); for (List::const_iterator it = container.begin(); it != container.end(); ++it) { IntegerVector vec = as(*it); std::copy(vec.begin(), vec.end(), output.begin() + pos); pos += vec.size(); } return output; } ================================================ FILE: src/utilities.h ================================================ //---------------------------------------------------------------------- // R interface to dbscan using the ANN library //---------------------------------------------------------------------- // Copyright (c) 2015 Michael Hahsler. All Rights Reserved. // // This software is provided under the provisions of the // GNU General Public License (GPL) Version 3 // (see: http://www.gnu.org/licenses/gpl-3.0.en.html) #ifndef UTILITIES_H #define UTILITIES_H #include using namespace Rcpp; // contains used in hdbscan.cpp template bool contains(const T& container, const C& key) { if (std::find(container.begin(), container.end(), key) != container.end()) { return true; } return false; } // extract the lower triangle from a matrix // [[Rcpp::export]] IntegerVector lowerTri(IntegerMatrix m); // internal c (combine) for Rcpp vectors NumericVector combine(const NumericVector& t1, const NumericVector& t2); IntegerVector combine(const IntegerVector& t1, const IntegerVector& t2); // Faster version of above combine function, assuming you can precompute and store // the containers needing to be concatenated IntegerVector concat_int(List const& container); #endif ================================================ FILE: tests/testthat/test-dbcv.R ================================================ test_that("dbcv", { # From: https://github.com/FelSiq/DBCV # # Dataset MATLAB # dataset_1.txt 0.8576 # dataset_2.txt 0.8103 # dataset_3.txt 0.6319 # dataset_4.txt 0.8688 # # Original MATLAB implementation is at: # https://github.com/pajaskowiak/dbcv/tree/main/data data(Dataset_1) x <- Dataset_1[, c("x", "y")] class <- Dataset_1$class #clplot(x, class) (db <- dbcv(x, class, metric = "sqeuclidean")) expect_equal(round(db$score, 2), 0.86) # detailed results from the Python implementation #dsc [0.00457826 0.00457826 0.0183068 0.0183068 ] #dspc [0.85627898 0.85627898 0.85627898 0.85627898] #vcs [0.99465331 0.99465331 0.97862052 0.97862052] #0.8575741400490697 data(Dataset_2) x <- Dataset_2[, c("x", "y")] class <- Dataset_2$class #clplot(x, class) (db <- dbcv(x, class, metric = "sqeuclidean")) expect_equal(round(db$score, 2), 0.81) #dsc [19.06151967 15.6082 83.71522964 68.969] #dspc [860.2538 501.4376 501.4376 860.2538] #vcs [0.97784198 0.9688731 0.83304956 0.91982715] #0.8103343589093096 # more data sets # data(Dataset_3) # x <- Dataset_3[, c("x", "y")] # class <- Dataset_3$class # #clplot(x, class) # (db <- dbcv(x, class, metric = "sqeuclidean")) # # data(Dataset_4) # x <- Dataset_4[, c("x", "y")] # class <- Dataset_4$class # #clplot(x, class) # (db <- dbcv(x, class, metric = "sqeuclidean")) }) ================================================ FILE: tests/testthat/test-dbscan.R ================================================ test_that("dbscan works", { data("iris") ## Species is a factor expect_error(dbscan(iris)) iris <- as.matrix(iris[, 1:4]) res <- dbscan(iris, eps = .4, minPts = 4) expect_length(res$cluster, nrow(iris)) ## expected result of table(res$cluster) is: expect_identical(table(res$cluster, dnn = NULL), as.table(c("0" = 25L, "1" = 47L, "2" = 38L, "3" = 36L, "4" = 4L))) ## compare with dbscan from package fpc (only if installed) if (requireNamespace("fpc", quietly = TRUE)) { res2 <- fpc::dbscan(iris, eps = .4, MinPts = 4) expect_equal(res$cluster, res2$cluster) ## test is.corepoint all(res2$isseed == is.corepoint(iris, eps = .4, minPts = 4)) } ## compare with precomputed frNN fr <- frNN(iris, eps = .4) res9 <- dbscan(fr, minPts = 4) expect_equal(res, res9) ## compare on example data from fpc set.seed(665544) n <- 600 x <- cbind( x = runif(10, 0, 10) + rnorm(n, sd = 0.2), y = runif(10, 0, 10) + rnorm(n, sd = 0.2) ) res <- dbscan(x, eps = .2, minPts = 4) expect_length(res$cluster, nrow(x)) ## compare with dist-based versions res_d <- dbscan(dist(x), eps = .2, minPts = 4) expect_identical(res, res_d) res_d2 <- dbscan(x, eps = .2, minPts = 4, search = "dist") expect_identical(res, res_d2) ## compare with dbscan from package fpc (only if installed) if (requireNamespace("fpc", quietly = TRUE)) { res2 <- fpc::dbscan(x, eps = .2, MinPts = 4) expect_equal(res$cluster, res2$cluster) } ## missing values, but distances are fine x_na <- x x_na[c(1, 3, 5), 1] <- NA expect_error(dbscan(x_na, eps = .2, minPts = 4), regexp = "NA") res_d1 <- dbscan(x_na, eps = .2, minPts = 4, search = "dist") res_d2 <- dbscan(dist(x_na), eps = .2, minPts = 4) expect_identical(res_d1, res_d2) ## introduce NAs into dist x_na[c(1,3,5), 2] <- NA expect_error(dbscan(x_na, eps = .2, minPts = 4), regexp = "NA") expect_error(dbscan(x_na, eps = .2, minPts = 4, search = "dist"), regexp = "NA") expect_error(dbscan(dist(x_na), eps = .2, minPts = 4), regexp = "NA") ## call with no rows or no columns expect_error(dbscan(matrix(0, nrow = 0, ncol = 2), eps = .2, minPts = 4)) expect_error(dbscan(matrix(0, nrow = 2, ncol = 0), eps = .2, minPts = 4)) dbscan(matrix(0, nrow = 1, ncol = 1), eps = .2, minPts = 4) }) ================================================ FILE: tests/testthat/test-fosc.R ================================================ test_that("FOSC", { data("iris") ## FOSC expects an hclust object expect_error(extractFOSC(iris)) x <- iris[, 1:4] x_sl <- hclust(dist(x), "single") ## Should return augmented hclust object and cluster assignments expect_length(extractFOSC(x_sl), 2) res <- extractFOSC(x_sl) expect_identical(res$hc$method, "single (w/ stability-based extraction)") ## Constraint-checking expect_error(extractFOSC(x_sl, constraints = c("1" = 2))) ## Matrix inputs must be nxn expect_error(extractFOSC(x_sl, constraints = matrix(c(1, 2), nrow=1))) ## Matrix or vector constraints must be in c(-1, 0, 1) expect_error(extractFOSC(x_sl, constraints = matrix(-2, nrow=nrow(x), ncol=nrow(x)))) ## Valid constraints expect_warning(extractFOSC(x_sl, constraints = matrix(1, nrow=nrow(x), ncol=nrow(x)))) expect_silent(extractFOSC(x_sl, constraints = list("1" = 2, "2" = 1))) expect_silent(extractFOSC(x_sl, constraints = ifelse(dist(x) > 2, -1, 1))) ## Constraints should be symmetric, but symmetry test is only done if specified. Asymmetric ## constraints through warning, but proceeds with manual warning expect_warning(extractFOSC(x_sl, constraints = list("1" = 2), validate_constraints = TRUE)) ## Make sure that's whats returned res <- extractFOSC(x_sl) expect_type(res$cluster, "integer") expect_s3_class(res$hc, "hclust") ## Test 'Optimal' Clustering using only positive constraints set <- which(iris$Species == "setosa") ver <- which(iris$Species == "versicolor") vir <- which(iris$Species == "virginica") il_constraints <- structure(list(set[-1], ver[-1], vir[-1]), names = as.character(c(set[1], ver[1], vir[1]))) res <- extractFOSC(x_sl, il_constraints) ## Positive-only constraints should link to best unsupervised solution expect_identical(table(res$cluster, dnn = NULL), as.table(c(`1` = 50L, `2` = 100L))) expect_identical(res$hc$method, "single (w/ constraint-based extraction)") ## Test negative constraints set2 <- c(il_constraints[[as.character(set[1])]], -unlist(il_constraints[as.character(c(ver[1], vir[1]))], use.names = FALSE)) ver2 <- c(il_constraints[[as.character(ver[1])]], -unlist(il_constraints[as.character(c(set[1], vir[1]))], use.names = FALSE)) vir2 <- c(il_constraints[[as.character(vir[1])]], -unlist(il_constraints[as.character(c(set[1], ver[1]))], use.names = FALSE)) il_constraints2 <- structure(list(set2, ver2, vir2), names = as.character(c(set[1], ver[1], vir[1]))) res2 <- extractFOSC(x_sl, constraints = il_constraints2) ## Positive and Negative should produce a different solution expect_false(all(res$cluster == res2$cluster)) expect_identical(res2$hc$method, "single (w/ constraint-based extraction)") ## Test minPts parameters expect_error(extractFOSC(x_sl, constraints = il_constraints2, minPts = 1)) expect_silent(extractFOSC(x_sl, constraints = il_constraints2, minPts = 5)) ## Test alpha parameter expect_silent(extractFOSC(x_sl, constraints = il_constraints2, alpha = 0.5)) expect_error(extractFOSC(x_sl, constraints = il_constraints2, alpha = 1.5)) res3 <- extractFOSC(x_sl, constraints = il_constraints2, alpha = 0.5) expect_identical(res3$hc$method, "single (w/ mixed-objective extraction)") ## Test unstable pruning expect_silent(extractFOSC(x_sl, constraints = il_constraints2, prune_unstable = TRUE)) }) ================================================ FILE: tests/testthat/test-frNN.R ================================================ test_that("frNN", { set.seed(665544) n <- 1000 x <- cbind( x = runif(10, 0, 10) + rnorm(n, sd = 0.2), y = runif(10, 0, 10) + rnorm(n, sd = 0.2), z = runif(10, 0, 10) + rnorm(n, sd = 0.2) ) ## no duplicates first! #x <- x[!duplicated(x),] rownames(x) <- paste0("Object_", seq_len(nrow(x))) eps <- .5 nn <- frNN(x, eps = eps, sort = TRUE) ## check dimensions expect_identical(nn$eps, eps) expect_length(nn$dist, nrow(x)) expect_length(nn$id, nrow(x)) expect_identical(lengths(nn$dist), lengths(nn$id)) ## check visually #plot(x) #points(x[nn$id[[1]],], col="red", lwd=5) #points(x[nn$id[[2]],], col="green", lwd=5) #points(x[1:2,, drop = FALSE], col="blue", pch="+", cex=2) ## compare with manually found NNs nn_d <- frNN(dist(x), eps = eps, sort = TRUE) expect_equal(nn, nn_d) nn_d2 <- frNN(x, eps = eps, sort = TRUE, search = "dist") expect_equal(nn, nn_d2) ## without sorting nn2 <- frNN(x, eps = eps, sort = FALSE) expect_identical(lapply(nn$id, sort), lapply(nn2$id, sort)) ## search options nn_linear <- frNN(x, eps=eps, search = "linear") expect_equal(nn, nn_linear) ## split options for (so in c("STD", "MIDPT", "FAIR", "SL_FAIR")) { nn3 <- frNN(x, eps=eps, splitRule = so) expect_equal(nn, nn3) } ## bucket size for (bs in c(5, 10, 15, 100)) { nn3 <- frNN(x, eps=eps, bucketSize = bs) expect_equal(nn, nn3) } ## add 100 copied points to check if self match filtering works x <- rbind(x, x[sample(seq_len(nrow(x)), 100),]) rownames(x) <- paste0("Object_", seq_len(nrow(x))) eps <- .5 nn <- frNN(x, eps = eps, sort = TRUE) ## compare with manually found NNs nn_d <- frNN(x, eps = eps, sort = TRUE, search = "dist") expect_equal(nn, nn_d) ## sort and frNN to reduce eps nn5 <- frNN(x, eps = .5, sort = FALSE) expect_false(nn5$sort) nn5s <- sort(nn5) expect_true(nn5s$sort) expect_true(all(vapply(nn5s$dist, function(x) !is.unsorted(x), logical(1L)))) expect_error(frNN(nn5, eps = 1)) nn2 <- frNN(nn5, eps = .2) expect_true(all(vapply(nn2$dist, function(x) all(x <= 0.2), logical(1L)))) ## test with simple data x <- data.frame(x = 1:10, row.names = LETTERS[1:10], check.names = FALSE) nn <- frNN(x, eps = 2) expect_identical(nn$id[[1]], 2:3) expect_identical(nn$id[[5]], c(4L, 6L, 3L, 7L)) expect_identical(nn$id[[10]], 9:8) ## test kNN with query x <- data.frame(x = 1:10, row.names = LETTERS[1:10], check.names = FALSE) nn <- frNN(x[1:8, , drop=FALSE], x[9:10, , drop = FALSE], eps = 2) expect_length(nn$id, 2L) expect_identical(nn$id[[1]], 8:7) expect_identical(nn$id[[2]], 8L) expect_error(frNN(dist(x[1:8, , drop=FALSE]), x[9:10, , drop = FALSE], eps = 2)) }) ================================================ FILE: tests/testthat/test-hdbscan.R ================================================ test_that("HDBSCAN", { data("iris") ## minPts not given expect_error(hdbscan(iris)) ## Expects numerical data; species is factor expect_error(dbscan(iris, minPts = 4)) iris <- as.matrix(iris[,1:4]) res <- hdbscan(iris, minPts = 4) expect_length(res$cluster, nrow(iris)) ## expected result of table(res$cluster) is: expect_identical(table(res$cluster, dnn = NULL), as.table(c("1" = 100L, "2" = 50L))) ## compare on moons data data("moons") res <- hdbscan(moons, minPts = 5) expect_length(res$cluster, nrow(moons)) ## Check hierarchy matches dbscan* at every value check <- rep(FALSE, nrow(moons)-1) core_dist <- kNNdist(moons, k=5-1) ## cutree doesn't distinguish noise as 0, so we make a new method to do it manually cut_tree <- function(hcl, eps, core_dist){ cuts <- unname(cutree(hcl, h=eps)) cuts[which(core_dist > eps)] <- 0 # Use core distance to distinguish noise cuts } eps_values <- sort(res$hc$height, decreasing = TRUE)+.Machine$double.eps ## Machine eps for consistency between cuts for (i in seq_along(eps_values)) { cut_cl <- cut_tree(res$hc, eps_values[i], core_dist) dbscan_cl <- dbscan(moons, eps = eps_values[i], minPts = 5, borderPoints = FALSE) # DBSCAN* doesn't include border points ## Use run length encoding as an ID-independent way to check ordering check[i] <- (all.equal(rle(cut_cl)$lengths, rle(dbscan_cl$cluster)$lengths) == "TRUE") } expect_true(all(check)) ## Expect generating extra trees doesn't fail res <- hdbscan(moons, minPts = 5, gen_hdbscan_tree = TRUE, gen_simplified_tree = TRUE) expect_s3_class(res, "hdbscan") ## Expect hdbscan tree matches stats:::as.dendrogram version of hclust object hc_dend <- as.dendrogram(res$hc) expect_s3_class(hc_dend, "dendrogram") expect_identical(hc_dend, res$hdbscan_tree) ## Expect hdbscan works with non-euclidean distances dist_moons <- dist(moons, method = "canberra") res <- hdbscan(dist_moons, minPts = 5) expect_s3_class(res, "hdbscan") }) test_that("mrdist", { expect_identical(mrdist(cbind(1:10), 2), mrdist(dist(cbind(1:10)), 2)) expect_identical(mrdist(cbind(1:11), 3), mrdist(dist(cbind(1:11)), 3)) }) test_that("HDBSCAN(e)", { X <- data.frame( x = c( 0.08, 0.46, 0.46, 2.95, 3.50, 1.49, 6.89, 6.87, 0.21, 0.15, 0.15, 0.39, 0.80, 0.80, 0.37, 3.63, 0.35, 0.30, 0.64, 0.59, 1.20, 1.22, 1.42, 0.95, 2.70, 6.36, 6.36, 6.36, 6.60, 0.04, 0.71, 0.57, 0.24, 0.24, 0.04, 0.04, 1.35, 0.82, 1.04, 0.62, 0.26, 5.98, 1.67, 1.67, 0.48, 0.15, 6.67, 6.67, 1.20, 0.21, 3.99, 0.12, 0.19, 0.15, 6.96, 0.26, 0.08, 0.30, 1.04, 1.04, 1.04, 0.62, 0.04, 0.04, 0.04, 0.82, 0.82, 1.29, 1.35, 0.46, 0.46, 0.04, 0.04, 5.98, 5.98, 6.87, 0.37, 6.47, 6.47, 6.47, 6.67, 0.30, 1.49, 3.21, 3.21, 0.75, 0.75, 0.46, 0.46, 0.46, 0.46, 3.63, 0.39, 3.65, 4.09, 4.01, 3.36, 1.43, 3.28, 5.94, 6.35, 6.87, 5.60, 5.99, 0.12, 0.00, 0.32, 0.39, 0.00, 1.63, 1.36, 5.67, 5.60, 5.79, 1.10, 2.99, 0.39, 0.18 ), y = c( 7.41, 8.01, 8.01, 5.44, 7.11, 7.13, 1.83, 1.83, 8.22, 8.08, 8.08, 7.20, 7.83, 7.83, 8.29, 5.99, 8.32, 8.22, 7.38, 7.69, 8.22, 7.31, 8.25, 8.39, 6.34, 0.16, 0.16, 0.16, 1.66, 7.55, 7.90, 8.18, 8.32, 8.32, 7.97, 7.97, 8.15, 8.43, 7.83, 8.32, 8.29, 1.03, 7.27, 7.27, 8.08, 7.27, 0.79, 0.79, 8.22, 7.73, 6.62, 7.62, 8.39, 8.36, 1.73, 8.29, 8.04, 8.22, 7.83, 7.83, 7.83, 8.32, 8.11, 7.69, 7.55, 7.20, 7.20, 8.01, 8.15, 7.55, 7.55, 7.97, 7.97, 1.03, 1.03, 1.24, 7.20, 0.47, 0.47, 0.47, 0.79, 8.22, 7.13, 6.48, 6.48, 7.10, 7.10, 8.01, 8.01, 8.01, 8.01, 5.99, 8.04, 5.22, 5.82, 5.14, 4.81, 7.62, 5.73, 0.55, 1.31, 0.05, 0.95, 1.59, 7.99, 7.48, 8.38, 7.12, 2.01, 1.40, 0.00, 9.69, 9.47, 9.25, 2.63, 6.89, 0.56, 3.11 ) ) hdbe <- hdbscan(X, minPts = 3, cluster_selection_epsilon = 1) #plot(X, col = hdbe$cluster + 1L, main = "HDBSCAN(e)") expect_equal(ncluster(hdbe), 5L) expect_equal(nnoise(hdbe), 0L) }) ================================================ FILE: tests/testthat/test-kNN.R ================================================ test_that("kNN", { set.seed(665544) n <- 1000 x <- cbind( x = runif(10, 0, 10) + rnorm(n, sd = 0.2), y = runif(10, 0, 10) + rnorm(n, sd = 0.2), z = runif(10, 0, 10) + rnorm(n, sd = 0.2) ) ## no duplicates first! All distances should be unique x <- x[!duplicated(x),] rownames(x) <- paste0("Object_", seq_len(nrow(x))) k <- 5L nn <- kNN(x, k=k, sort = TRUE) ## check dimensions expect_identical(nn$k, k) expect_identical(dim(nn$dist), c(nrow(x), k)) expect_identical(dim(nn$id), c(nrow(x), k)) ## check visually #plot(x) #points(x[nn$id[1,],], col="red", lwd=5) #points(x[nn$id[2,],], col="green", lwd=5) ## compare with kNN found using distances nn_d <- kNN(dist(x), k, sort = TRUE) ## check visually #plot(x) #points(x[nn_d$id[1,],], col="red", lwd=5) #points(x[nn_d$id[2,],], col="green", lwd=5) ### will agree since we use sorting expect_equal(nn, nn_d) ## calculate dist internally nn_d2 <- kNN(x, k, search = "dist", sort = TRUE) expect_equal(nn, nn_d2) ## without sorting nn2 <- kNN(x, k=k, sort = FALSE) expect_equal(t(apply(nn$id, MARGIN = 1, sort)), t(apply(nn2$id, MARGIN = 1, sort))) ## search options nn_linear <- kNN(x, k=k, search = "linear", sort = TRUE) expect_equal(nn, nn_linear) ## split options for(so in c("STD", "MIDPT", "FAIR", "SL_FAIR")) { nn3 <- kNN(x, k=k, splitRule = so, sort = TRUE) expect_equal(nn, nn3) } ## bucket size for (bs in c(5, 10, 15, 100)) { nn3 <- kNN(x, k=k, bucketSize = bs, sort = TRUE) expect_equal(nn, nn3) } ## the order is not stable with matching distances which means that the ## k-NN are not stable. We add 100 copied points to check if self match ## filtering and sort works x <- rbind(x, x[sample(seq_len(nrow(x)), 100),]) rownames(x) <- paste0("Object_", seq_len(nrow(x))) k <- 5L nn <- kNN(x, k=k, sort = TRUE) ## compare with manually found NNs nn_d <- kNN(x, k=k, search = "dist", sort = TRUE) expect_equal(nn$dist, nn_d$dist) ## This is expected to fail: because the ids are not stable for matching distances ## expect_equal(nn$id, nn_d$id) ## FIXME: write some code to check this! ## missing values, but distances are fine x_na <- x x_na[c(1, 3, 5), 1] <- NA expect_error(kNN(x_na, k = 3), regexp = "NA") res_d1 <- kNN(x_na, k = 3, search = "dist") res_d2 <- kNN(dist(x_na), k = 3) expect_equal(res_d1, res_d2) ## introduce NAs into dist x_na[c(1, 3, 5),] <- NA expect_error(kNN(x_na, k = 3), regexp = "NA") expect_error(kNN(x_na, k = 3, search = "dist"), regexp = "NA") expect_error(kNN(dist(x_na), k = 3), regexp = "NA") ## inf x_inf <- x x_inf[c(1, 3, 5), 2] <- Inf kNN(x_inf, k = 3) kNN(x_inf, k = 3, search = "dist") kNN(dist(x_inf), k = 3) ## sort and kNN to reduce k nn10 <- kNN(x, k = 10) #nn10 <- kNN(x, k = 10, sort = FALSE) ## knn now returns sorted lists #expect_equal(nn10$sort, FALSE) expect_error(kNN(nn10, k = 11)) nn5 <- kNN(nn10, k = 5) expect_true(nn5$sort) expect_identical(ncol(nn5$id), 5L) expect_identical(ncol(nn5$dist), 5L) ## test with simple data x <- data.frame(x = 1:10, row.names = LETTERS[1:10], check.names = FALSE) nn <- kNN(x, k = 5) expect_identical(unname(nn$id[1, ]), 2:6) expect_identical(unname(nn$id[5, ]), c(4L, 6L, 3L, 7L, 2L)) expect_identical(unname(nn$id[10, ]), 9:5) ## test kNN with query x <- data.frame(x = 1:10, row.names = LETTERS[1:10], check.names = FALSE) nn <- kNN(x[1:8, , drop=FALSE], x[9:10, , drop = FALSE], k = 5) expect_identical(nrow(nn$id), 2L) expect_identical(unname(nn$id[1, ]), 8:4) expect_identical(unname(nn$id[2, ]), 8:4) expect_error(kNN(dist(x[1:8, , drop=FALSE]), x[9:10, , drop = FALSE], k = 5)) }) ================================================ FILE: tests/testthat/test-kNNdist.R ================================================ test_that("kNNdist", { set.seed(665544) n <- 1000 x <- cbind( x = runif(10, 0, 10) + rnorm(n, sd = 0.2), y = runif(10, 0, 10) + rnorm(n, sd = 0.2), z = runif(10, 0, 10) + rnorm(n, sd = 0.2) ) d <- kNNdist(x, k = 5) expect_length(d, n) d <- kNNdist(x, k = 5, all = TRUE) expect_equal(dim(d), c(n, 5)) # does the plot work? #kNNdistplot(x, 5) }) ================================================ FILE: tests/testthat/test-lof.R ================================================ test_that("LOF", { set.seed(665544) n <- 600 x <- cbind( x=runif(10, 0, 5) + rnorm(n, sd=0.4), y=runif(10, 0, 5) + rnorm(n, sd=0.4) ) ### calculate LOF score system.time(lof_kd <- lof(x, minPts = 5)) expect_length(lof_kd, nrow(x)) system.time(lof_d <- lof(dist(x), minPts = 5)) #expect_equal(lof_kd, lof_d) ## compare with lofactor from DMwR (k = minPts - 1) #if(requireNamespace("DMwR", quietly = TRUE)) { # system.time(lof_DMwr <- DMwR::lofactor(x, k = 4)) # DMwR is now retired so we have the correct values here # dput(round(lof_DMwr, 7)) lof_DMwr <- c(1.0386817, 1.0725475, 1.1440822, 0.9448794, 1.1387918, 2.285202, 1.0976862, 1.071325, 0.975922, 0.9549399, 1.0918247, 0.9868736, 1.123618, 2.2802129, 0.992019, 1.046492, 1.0729966, 1.6925297, 1.0032157, 0.9691323, 1.0561082, 0.9493052, 1.0209116, 0.8897277, 1.008681, 1.0711202, 1.053845, 0.9734241, 1.1147289, 0.9351913, 1.8674401, 1.097982, 0.9782695, 1.0613472, 0.9988367, 1.4571062, 0.9927837, 0.9443716, 1.0014804, 1.0322888, 0.9264795, 0.9509729, 0.9757305, 1.0647956, 1.0184634, 1.428911, 1.0166712, 0.9692196, 1.0821285, 1.1282936, 0.9874694, 1.1079347, 0.9906487, 0.9972962, 1.0594364, 0.9160978, 1.2393862, 1.3578505, 0.930095, 1.0489962, 1.1401282, 1.1808566, 1.0380796, 2.0657157, 0.9837392, 0.9712287, 1.4754447, 1.3154291, 1.0589814, 1.0486608, 1.0986178, 1.1375705, 1.0147473, 1.7615974, 0.9724805, 0.9719851, 0.982247, 1.0591561, 1.0862436, 1.0710844, 1.11301, 0.9719126, 1.0455651, 0.9426225, 1.0934785, 1.1223749, 1.1734774, 1.0037237, 0.8844162, 0.9131705, 1.0728687, 1.0446755, 1.108353, 0.9492501, 1.1704727, 1.1914106, 0.9453222, 1.1724001, 1.1827576, 0.9617445, 1.1519398, 1.1480532, 1.0268692, 1.0580088, 1.392551, 1.2571354, 0.9703385, 1.5030845, 1.0201881, 1.0061842, 0.9919245, 1.2771078, 1.0473407, 1.263149, 0.9587146, 1.0235194, 0.988292, 0.9302287, 1.0593181, 0.978052, 1.1026427, 1.0615622, 1.0299466, 1.2200394, 1.0720229, 1.1343499, 1.0180289, 1.4500258, 0.9886391, 0.969401, 1.4881191, 1.0775279, 1.0380796, 1.2315327, 1.0307432, 0.9615078, 1.2379828, 1.1181202, 1.1049541, 1.0786524, 0.9197587, 1.0642223, 0.8073981, 0.9251505, 0.9971381, 1.5188771, 1.0679818, 0.9943418, 3.5343815, 0.9559526, 1.2129819, 1.0067672, 1.0175442, 1.0875222, 1.0403766, 2.0998678, 0.9870077, 1.327542, 1.0081014, 0.9608997, 0.9144311, 1.0016777, 1.0465469, 1.5140562, 1.5560253, 1.1125134, 1.0310594, 1.0245521, 1.7247798, 1.0586581, 1.0720232, 1.0594747, 0.956174, 1.0540952, 1.0889792, 1.050014, 1.0216425, 0.9509729, 0.9740812, 1.3065791, 1.0004211, 1.0127932, 0.9796374, 1.0552426, 1.0302613, 0.9524017, 0.9554341, 0.9870971, 0.9857225, 0.9699046, 1.1122461, 1.031985, 1.0852427, 1.0585017, 0.9733342, 0.9610561, 0.9086219, 1.1570747, 1.069232, 0.9747538, 1.0084392, 1.1063077, 0.9573789, 1.3672764, 1.3631144, 0.966934, 1.0992401, 0.9943351, 0.9850424, 1.0019623, 1.5344698, 0.9592966, 0.9645661, 1.0076189, 1.0056102, 1.0066028, 1.0148453, 1.0096178, 1.0963682, 1.0345623, 1.0121158, 1.0816582, 1.0068326, 0.9697611, 0.9322887, 1.1414811, 1.0266256, 0.9143263, 0.9602328, 1.1100272, 1.0885216, 1.0795966, 1.1165265, 1.1712866, 1.1478981, 0.9653769, 1.0419996, 1.0245088, 1.0619264, 1.1729143, 0.9756447, 0.9935498, 2.8554242, 1.0067806, 1.1311249, 1.36881, 1.8759446, 1.2136268, 1.2112035, 0.9891436, 1.1089825, 0.9937973, 0.9730926, 1.0287588, 1.1275406, 1.5135599, 1.0322888, 1.0746697, 1.0181387, 1.2715467, 0.9196022, 1.1063077, 1.0666201, 1.121323, 1.0850662, 0.9150997, 1.428667, 0.9488952, 1.1007532, 1.2246563, 0.9933742, 1.1263888, 0.985569, 1.0275125, 1.01964, 1.0449989, 0.9767297, 0.9704362, 0.9897834, 1.0246062, 1.0947694, 1.2170169, 1.1323645, 1.2366689, 0.9516316, 1.2727108, 1.0480459, 1.0338822, 1.1418884, 1.0733666, 1.0230934, 0.9149864, 0.9480381, 1.0388333, 1.1266161, 0.9615078, 1.1221968, 0.9750836, 0.978132, 1.1412698, 0.9716957, 1.0675609, 1.2594503, 1.0633289, 1.1427586, 1.0709402, 1.0393154, 1.3284915, 0.9598698, 1.1755224, 1.2392279, 1.0625965, 1.133851, 1.1631179, 1.4499444, 1.20366, 0.9606104, 0.9921343, 0.8938437, 1.1738624, 1.0131062, 1.0027174, 0.9461069, 0.9717685, 1.0645426, 1.046492, 1.1502628, 0.999057, 0.9758641, 1.1654844, 0.9964193, 1.1066967, 1.1900241, 1.0727625, 1.1304909, 1.0892065, 0.963785, 1.2942228, 1.0619264, 1.2733898, 0.9840458, 1.109005, 1.0437884, 1.0298398, 0.9513221, 1.0823791, 1.0056102, 0.8875967, 1.1385844, 0.8947159, 1.229025, 2.0563263, 0.9387754, 0.9683886, 1.2059569, 0.9923111, 1.4218394, 1.043666, 0.9963639, 1.0610107, 1.0049425, 0.9844978, 1.0292947, 0.9768325, 1.0528094, 1.0155664, 1.1586381, 1.0432875, 1.0382743, 0.9793557, 1.1206471, 0.985182, 1.1138052, 1.3397872, 1.0062782, 0.9474922, 1.2033802, 1.0889565, 0.9172793, 0.9749791, 0.9912765, 1.2617741, 0.9875289, 0.9231973, 1.1543416, 1.084554, 0.9805775, 0.9976991, 1.0076805, 1.0267488, 0.9919245, 1.0627179, 0.9760528, 1.14714, 0.947823, 1.0574966, 1.0560581, 0.9939038, 1.1754719, 0.9804448, 1.1892616, 1.2926922, 1.0381062, 0.9991459, 1.0110192, 1.7982637, 0.9932575, 1.0365072, 1.0476382, 0.9572147, 1.0362918, 0.929587, 1.1575934, 1.0942486, 1.1386353, 1.0484103, 1.0846261, 0.9627105, 1.0514676, 1.0148971, 0.9468566, 1.1103724, 1.0637948, 1.9343892, 1.0520743, 1.0526934, 1.0679818, 1.0045373, 1.3400328, 0.9598806, 1.0309374, 0.9556979, 1.3586868, 0.9806832, 1.0108765, 0.9652751, 1.9171728, 1.1786559, 1.0223136, 0.9491173, 1.0020994, 0.977787, 1.0659739, 1.4374944, 1.0311553, 1.0109194, 1.4310709, 0.9937973, 1.1235442, 1.0475279, 1.0221015, 1.0810464, 1.6977976, 1.0944615, 1.0511645, 1.0957941, 1.4443457, 1.0375637, 1.1045543, 1.0264414, 1.0205876, 1.3753965, 1.0976175, 1.0539255, 1.037731, 1.0592793, 1.0109924, 1.0427939, 1.1111455, 1.04521, 0.9745986, 1.3716186, 1.0089931, 1.0603559, 1.5494147, 0.9854366, 1.2662523, 0.9623836, 1.3929899, 0.999679, 1.0011268, 1.0179427, 1.0416134, 1.7609114, 1.069779, 1.0366241, 1.1245068, 0.9792311, 0.967655, 0.9542575, 1.1684304, 1.2482993, 1.2640331, 1.0298585, 0.9111223, 1.0672941, 0.9855631, 0.9206366, 1.1058931, 1.0740426, 0.9649612, 1.3460875, 0.9493052, 1.0763382, 1.0750445, 1.1003632, 1.0639591, 1.0930897, 0.9366367, 1.4825478, 0.9872073, 1.0595017, 0.9098508, 0.9132522, 0.9715029, 1.3445599, 0.9442429, 0.9947035, 1.5735628, 1.0179848, 1.1207158, 1.4513845, 0.9971349, 1.0549698, 1.0829184, 0.9570918, 1.1063325, 1.049832, 1.6941119, 0.976464, 1.0548108, 1.0429154, 1.1387078, 1.252386, 1.4497295, 1.2952889, 1.0345598, 1.3188744, 1.059327, 0.9671478, 0.9628657, 0.9935354, 1.2020615, 0.977946, 1.0286028, 0.9360817, 0.9507702, 1.0119649, 1.49294, 0.9929636, 1.0500374, 1.3857874, 1.271137, 1.2183431, 1.0284245, 1.2371945, 1.1308861, 1.386502, 1.0364896, 1.222194, 1.0893758, 1.3687506, 0.9889728, 0.9717685, 0.9804448, 1.0066674, 0.9703385, 1.5495994, 1.0779985, 0.9233493, 1.1049508, 1.0770304, 0.9206519, 1.645557, 1.0494959, 1.1984923, 1.4967244, 0.9976991, 1.0476285, 0.9612643, 0.9270878, 0.9683637, 1.1585881, 1.0376168, 0.9816509, 0.9598896, 1.035713, 1.0170878, 0.9578521, 0.9849839, 0.9363952, 0.9856201, 1.0240401, 1.1739687, 1.1257174, 0.9772498, 0.9539389, 0.9537187, 1.3452872, 0.9888146 ) expect_equal(round(lof_kd, 7), lof_DMwr) expect_equal(round(lof_d, 7), lof_DMwr) ## missing values, but distances are fine x_na <- x x_na[c(1,3,5), 1] <- NA expect_error(lof(x_na), regexp = "NA") res_d1 <- lof(x_na, search = "dist") res_d2 <- lof(dist(x_na)) expect_equal(res_d1, res_d2) x_na[c(1,3,5), 2] <- NA expect_error(lof(x_na), regexp = "NA") expect_error(lof(x_na, search = "dist"), regexp = "NA") expect_error(lof(dist(x_na)), regexp = "NA") ## test with tied distances x <- rbind(1,2,3,4,5,6,7) expect_equal(round(lof(x, minPts = 4), 7), c(1.0679012, 1.0679012, 1.0133929, 0.8730159, 1.0133929, 1.0679012, 1.0679012)) expect_equal(round(lof(dist(x), minPts = 4),7), c(1.0679012, 1.0679012, 1.0133929, 0.8730159, 1.0133929, 1.0679012, 1.0679012)) }) ================================================ FILE: tests/testthat/test-mst.R ================================================ test_that("mst", { draw_mst <- function(x, m) { plot(x) text(x, labels = 1:nrow(x), pos = 1) for (i in seq(nrow(m))) { from_to <- rbind(x[m[i, 1], ], x[m[i, 2], ]) lines(from_to[, 1], from_to[, 2]) } } x <- rbind(c(0, 0), c(0, 1), c(1, 1)) d <- dist(x) (m <- mst(d, n = nrow(x))) #draw_mst(x, m) expect_equal(m, structure( c(2, 3, 1, 2, 1, 1), dim = 2:3, dimnames = list(NULL, c("from", "to", "weight")) )) x <- rbind(c(0, 0), c(1, 0), c(0, 1), c(1, 1), c(2, 1), c(1, 2), c(.7, 1), c(.7, .7), c(.7, 1.3)) d <- dist(x) (m <- mst(d, n = nrow(x))) #draw_mst(x, m) expect_equal(m, structure( c( 2, 3, 4, 5, 6, 7, 8, 9, 8, 7, 7, 4, 9, 8, 1, 7, 0.761577310586391, 0.7, 0.3, 1, 0.761577310586391, 0.3, 0.989949493661166, 0.3 ), dim = c(8L, 3L), dimnames = list(NULL, c("from", "to", "weight")) )) # data("Dataset_2") # x <- Dataset_2[,1:2] # cl <- Dataset_2[,3] # x_3 <- x[cl==3, ] # # (m <- mst(dist(x_3), n = nrow(x_3))) # max(m[,3]) # draw_mst(x_3, m) }) test_that("dist_subset", { x <- rbind(c(0, 0), c(1, 0), c(0, 1), c(1, 1), c(2, 1), c(1, 2), c(.7, 1), c(.7, .7), c(.7, 1.3)) d <- dist(x) m <- as.matrix(d) s <- c(1:3, 6) (d_sub <- dist_subset(d, s)) (m_sub <- m[s,s]) expect_equal(unname(as.matrix(d_sub)), unname(m_sub)) }) ================================================ FILE: tests/testthat/test-optics.R ================================================ test_that("OPTICS", { load(test_path("fixtures", "test_data.rda")) load(test_path("fixtures", "elki_optics.rda")) x <- test_data ### run OPTICS eps <- .1 #eps <- .06 eps_cl <- .1 minPts <- 10 res <- optics(x, eps = eps, minPts = minPts) expect_length(res$order, nrow(x)) expect_length(res$reachdist, nrow(x)) expect_length(res$coredist, nrow(x)) expect_identical(res$eps, eps) expect_identical(res$minPts, minPts) ### compare with distance based version! res_d <- optics(dist(x), eps = eps, minPts = minPts) expect_equal(res, res_d) #plot(res) #plot(res_d) ### compare with elki's result expect_equal(res$order, elki$ID) expect_equal(round(res$reachdist[res$order], 3), round(elki$reachability, 3)) ### compare result with DBSCAN ### "clustering created from a cluster-ordered is nearly indistinguishable ### from a clustering created by DBSCAN. Only some border objects may ### be missed" # extract DBSCAN clustering res <- extractDBSCAN(res, eps_cl = eps_cl) #plot(res) # are there any clusters with only border points? frnn <- frNN(x, eps_cl) good <- vapply(frnn$id, function(x) (length(x) + 1L) >= minPts, logical(1L)) #plot(x, col = (res$cluster+1L)) c_good <- res$cluster[good] c_notgood <- res$cluster[!good] expect_false(setdiff(c_notgood, c_good) != 0L) # compare with DBSCAN db <- dbscan(x, minPts = minPts, eps = eps) #plot(x, col = res$cluster+1L) #plot(x, col = db$cluster+1L) # match clusters (get rid of border points which might differ) pure <- vapply( split(db$cluster, res$cluster), function(x) length(unique(x)), integer(1L) ) expect_true(all(pure[names(pure) != "0"] == 1L)) ## missing values, but distances are fine x_na <- x x_na[c(1,3,5), 1] <- NA expect_error(optics(x_na, eps = .2, minPts = 4), regexp = "NA") res_d1 <- optics(x_na, eps = .2, minPts = 4, search = "dist") res_d2 <- optics(dist(x_na), eps = .2, minPts = 4) expect_equal(res_d1, res_d2) ## introduce NAs into dist x_na[c(1,3,5), 2] <- NA expect_error(optics(x_na, eps = .2, minPts = 4), regexp = "NA") expect_error(optics(x_na, eps = .2, minPts = 4, search = "dist"), regexp = "NA") expect_error(optics(dist(x_na), eps = .2, minPts = 4), regexp = "NA") ## Create OPTICS-converted and single-linkage dendrograms res <- optics(test_data, eps = Inf, minPts = 2) res_dend <- as.dendrogram(res) reference <- as.dendrogram(hclust(dist(test_data), method = "single")) ## Test dendrogram ordering expect_equal(as.integer(unlist(res_dend)), res$order) ## Test Single Linkage with minPts=2, eps=INF for strict equivalence ## Note: Reordering needed to correct for isomorphisms ref_order <- order.dendrogram(reference) reference <- reorder(reference, ref_order, agglo.FUN = mean) expect_equal(reference, reorder(res_dend, ref_order, agglo.FUN = mean)) # Make sure any epsilon that queries the entire neighborhood works, # error otherwise max_rd <- max(res$reachdist[!is.infinite(res$reachdist)], na.rm = TRUE) expect_error(as.dendrogram(optics(test_data, eps = max_rd-1e-7, minPts = 2)), regexp = "Eps") expect_error(as.dendrogram(optics(test_data, eps = max_rd, minPts = nrow(test_data) + 1)), regexp = "'minPts'") ## Test symmetric relation between reachability <-> dendrogram structures expect_equal(as.reachability(as.dendrogram(res))$reachdist, res$reachdist) expect_equal(as.reachability(as.dendrogram(res))$order, res$order) }) ================================================ FILE: tests/testthat/test-opticsXi.R ================================================ test_that("OPTICS-XI", { load(test_path("fixtures", "test_data.rda")) load(test_path("fixtures", "elki_optics.rda")) load(test_path("fixtures", "elki_optics_xi.rda")) ### run OPTICS XI with parameters: xi=0.01, eps=1.0, minPts=5 x <- test_data res <- optics(x, eps = 1.0, minPts = 5) res <- extractXi(res, xi = 0.10, minimum = FALSE) ### Check to make sure ELKI results match R expected <- res$clusters_xi[, c("start", "end")] class(expected) <- "data.frame" expect_identical(elki_optics_xi, expected) }) ================================================ FILE: tests/testthat/test-predict.R ================================================ test_that("predict", { set.seed(3) n <- 100 x_data <- cbind( x = runif(5, 0, 10) + rnorm(n, sd = 0.2), y = runif(5, 0, 10) + rnorm(n, sd = 0.2) ) x_noise <- cbind( x = runif(n/2, 0, 10), y = runif(n/2, 0, 10) ) x <- rbind(x_data, x_noise) # check if l points with a little noise are assigned to the same cluster l <- 20 newdata <- rbind( x_data[1:l,] + rnorm(2*l, 0, .05), x_noise[1:l,] + rnorm(2*l, 0, .05) ) idx <- c(1:l, n + (1:l)) #plot(x, col = rep(c("black", "gray"), each = n)) #points(newdata, col = rep(c("red", "gray"), each = l), pch = 16) # DBSCAN res <- dbscan(x, eps = .3, minPts = 3) pr <- predict(res, newdata, data = x) rbind(true = res$cluster[idx], pred = pr) expect_equal(res$cluster[idx], pr) #plot(x, col = ifelse(res$cluster == 0, "gray", res$cluster)) #points(newdata, col = ifelse(pr == 0, "gray", pr), pch = 16) # OPTICS res <- optics(x, minPts = 3) res <- extractDBSCAN(res, eps = .3) pr <- predict(res, newdata, data = x) rbind(true = res$cluster[idx], pred = pr) expect_equal(res$cluster[idx], pr) # currently no implementation for extractXi # HDBSCAN (note predict is not perfect for the data.) res <- hdbscan(x, minPts = 3) pr <- predict(res, newdata, data = x) rbind(true = res$cluster[idx], pred = pr) accuracy <- sum(res$cluster[idx] == pr)/length(pr) expect_true(accuracy > .9) # show misclassifications #plot(x, col = ifelse(res$cluster == 0, "gray", res$cluster)) #points(newdata, col = ifelse(pr == 0, "gray", pr), pch = 16) #points(newdata[res$cluster[idx] != pr,, drop = FALSE], col = "red", pch = 4, lwd = 2) }) ================================================ FILE: tests/testthat/test-sNN.R ================================================ test_that("sNN", { set.seed(665544) n <- 1000 x <- cbind( x = runif(10, 0, 10) + rnorm(n, sd = 0.2), y = runif(10, 0, 10) + rnorm(n, sd = 0.2), z = runif(10, 0, 10) + rnorm(n, sd = 0.2) ) ## no duplicates first! x <- x[!duplicated(x),] rownames(x) <- paste0("Object_", seq_len(nrow(x))) k <- 5L nn <- sNN(x, k=k, sort = TRUE) ## check dimensions expect_equal(nn$k, k) expect_equal(dim(nn$dist), c(nrow(x), k)) expect_equal(dim(nn$id), c(nrow(x), k)) ## check visually #plot(x) #points(x[nn$id[1,],], col="red", lwd=5) #points(x[nn$id[2,],], col="green", lwd=5) ## compare with kNN found using distances nn_d <- sNN(dist(x), k, sort = TRUE) ## check visually #plot(x) #points(x[nn_d$id[1,],], col="red", lwd=5) #points(x[nn_d$id[2,],], col="green", lwd=5) ### will aggree minus some tries expect_equal(nn, nn_d) ## calculate dist internally nn_d2 <- sNN(x, k, search = "dist", sort = TRUE) expect_equal(nn, nn_d2) ## missing values, but distances are fine x_na <- x x_na[c(1,3,5), 1] <- NA expect_error(sNN(x_na, k = 3), regexp = "NA") res_d1 <- sNN(x_na, k = 3, search = "dist") res_d2 <- sNN(dist(x_na), k = 3) expect_equal(res_d1, res_d2) ## introduce NAs into dist x_na[c(1,3,5),] <- NA expect_error(sNN(x_na, k = 3), regexp = "NA") expect_error(sNN(x_na, k = 3, search = "dist"), regexp = "NA") expect_error(sNN(dist(x_na), k = 3), regexp = "NA") ## sort and kNN to reduce k nn10 <- sNN(x, k = 10, sort = FALSE) expect_false(nn10$sort_shared) expect_error(sNN(nn10, k = 11)) nn5 <- sNN(nn10, k = 5, sort = TRUE) nn5_x <- sNN(x, k = 5, sort = TRUE) expect_equal(nn5, nn5_x) ## test with simple data x <- data.frame(x = 1:10, check.names = FALSE) nn <- sNN(x, k = 5) i <- 1 j_ind <- 1 j <- nn$id[i,j_ind] intersect(c(i, nn$id[i,]), nn$id[j,]) nn$shared[i,j_ind] # compute the sNN simularity in R ss <- matrix(nrow = nrow(x), ncol = nn$k) for(i in seq_len(nrow(x))) for(j_ind in 1:nn$k) ss[i, j_ind] <- length(intersect(c(i, nn$id[i,]), nn$id[nn$id[i,j_ind],])) expect_equal(nn$shared, ss) }) ================================================ FILE: tests/testthat.R ================================================ library(testthat) library(dbscan) test_check("dbscan") ================================================ FILE: vignettes/dbscan.Rnw ================================================ % !Rnw weave = Sweave \documentclass[nojss]{jss} % Package includes \usepackage[utf8]{inputenc} \usepackage[english]{babel} %\usepackage{esvect} % vv %\usepackage{algorithm} % algorithm tools %\usepackage[noend]{algpseudocode} % algorithmic (pseudocode) tools \usepackage{mathtools} % coloneqq \usepackage{amsthm} %\usepackage[dvipsnames]{xcolor} % for adding color to code %\usepackage{listings} % For pprinting r code blocks (w/o execution) \usepackage{amssymb} \usepackage{pifont} % http://ctan.org/pkg/pifont %\usepackage{float} \usepackage{tabularx} %\usepackage[toc,page]{appendix} % Remove sweave margins if possible %\usepackage[belowskip=-15pt,aboveskip=0pt]{caption} %\setlength{\intextsep}{8pt plus 1pt minus 1pt} %\setlength{\floatsep}{1ex} %\setlength{\textfloatsep}{1ex plus 1pt minus 1pt} %\setlength{\abovecaptionskip}{0ex} %\setlength{\belowcaptionskip}{0ex} % Aliases and commands \newtheorem{mydef}{Definition} \newcommand{\minus}{\scalebox{0.75}[1.0]{$-$}} \newcommand{\exdb}{\texttt{extractDBSCAN} } \mathchardef\mhyphen="2D % Define a "math hyphen" \newcommand{\cmark}{\ding{51}} % checkmark %% \VignetteIndexEntry{Fast Density-based Clustering (DBSCAN and OPTICS)} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% declarations for jss.cls %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \author{ Michael Hahsler \\Southern Methodist University \And Matthew Piekenbrock\\Wright State University \AND Derek Doran \\ Wright State University } \title{\pkg{dbscan}: Fast Density-based Clustering with \proglang{R}} \Plainauthor{Michael Hahsler, Matthew Piekenbrock, Derek Doran} \Plaintitle{dbscan: Fast Density-based Clustering with R} \Shorttitle{\pkg{dbscan}: Density-based Clustering with \proglang{R}} \Address{ Michael Hahsler\\ Department of Engineering Management, Information, and Systems\\ Bobby B. Lyle School of Engineering, SMU\\ P. O. Box 750123, Dallas, TX 75275\\ E-mail: \email{mhahsler@lyle.smu.edu}\\ URL: \url{https://michael.hahsler.net/} \vspace{5mm} Matt Piekenbrock\\ Department of Computer Science and Engineering\\ Dept. of Computer Science and Engineering, Wright State University\\ 3640 Colonel Glenn Hwy, Dayton, OH, 45435\\ E-mail: \email{piekenbrock.5@wright.edu} \vspace{5mm} Derek Doran\\ Department of Computer Science and Engineering\\ Dept. of Computer Science and Engineering, Wright State University\\ 3640 Colonel Glenn Hwy, Dayton, OH, 45435\\ E-mail: \email{derek.doran@wright.edu} } \Abstract { This article describes the implementation and use of the \proglang{R} package \pkg{dbscan}, which provides complete and fast implementations of the popular density-based clustering algorithm DBSCAN and the augmented ordering algorithm OPTICS. Compared to other implementations, \pkg{dbscan} offers open-source implementations using \proglang{C++} and advanced data structures like k-d trees to speed up computation. An important advantage of this implementation is that it is up-to-date with several primary advancements that have been added since their original publications, including artifact corrections and dendrogram extraction methods for OPTICS. Experiments with \pkg{dbscan}'s implementation of DBSCAN and OPTICS compared and other libraries such as FPC, ELKI, WEKA, PyClustering, SciKit-Learn and SPMF suggest that \pkg{dbscan} provides a very efficient implementation. } \Keywords{DBSCAN, OPTICS, Density-based Clustering, Hierarchical Clustering} \begin{document} % Do not move SweaveOpts into preamble \SweaveOpts{concordance=TRUE} % prefix.string=generated/dbscan \section{Introduction} Clustering is typically described as the process of finding structure in data by grouping similar objects together, where the resulting set of groups are called clusters. Many clustering algorithms directly apply the idea that clusters can be formed such that objects in the same cluster should be more similar to each other than to objects in other clusters. The notion of similarity (or distance) stems from the fact that objects are assumed to be data points embedded in a data space in which a similarity measure can be defined. Examples are methods based on solving the $k$-means problem or mixture models which find the parameters of a parametric generative probabilistic model from which the observed data are assumed to arise. Another approach is hierarchical clustering, which uses local heuristics to form a hierarchy of nested grouping of objects. Most of these approaches (with the notable exception of single-link hierarchical clustering) are biased towards clusters with convex, hyper-spherical shape. A detailed review of these clustering algorithms is provided in \cite{Kaufman:1990}, \cite{jain1999review}, and the more recent review by \cite{Aggarwal:2013}. Density-based clustering approaches clustering differently. It simply posits that clusters are contiguous `dense' regions in the data space (i.e., regions of high point density), separated by areas of low point density~\citep{kriegel:2011,sander2011density}. Density-based methods find such high-density regions representing clusters of arbitrary shape and typically have a structured means of identifying noise points in low-density regions. These properties provide advantages for many applications compared to other clustering approaches. For example, geospatial data may be fraught with noisy data points due to estimation errors in GPS-enabled sensors~\citep{Chen2014} and may have unique cluster shapes caused by the physical space the data was captured in. Density-based clustering is also a promising approach to clustering high-dimensional data~\citep{kailing2004density}, where partitions are difficult to discover, and where the physical shape constraints assumed by model-based methods are more likely to be violated. %While dimensionality reduction techniques enable the use of many clustering algorithms to cluster high dimensional data, density-based clustering enables us to group high-dimensional data without the loss of information and recognizing noisy data. % What DBSCAN has been used for Several density-based clustering algorithms have been proposed, including DBSCAN algorithm~\citep{ester1996density}, DENCLUE~\citep{hinneburg1998efficient} and many DBSCAN derivates like HDBSCAN~\citep{campello2015hierarchical}. These clustering algorithms are widely used in practice with applications ranging from finding outliers in datasets for fraud prevention~\citep{breunig2000lof}, to finding patterns in streaming data~\citep{chen2007density, cao2006density}, noisy signals~\citep{kriegel2005density,ester1996density,tran2006knn,hinneburg1998efficient,duan2007local}, gene expression data~\citep{jiang2003dhc}, multimedia databases~\citep{kisilevich2010p}, and road traffic~\citep{li2007traffic}. %%% MFH: I am not sure this is true. Some of these are not pure density-based % What is the aim of the DBSCAN package? %There are many meaningful ways to define 'natural' clusters based on density. As a result, numerous density-based clustering algorithms have been proposed within the past two decades, e.g., %BIRCH~\citep{zhang96}, %DBSCAN algorithm~\citep{ester1996density}, %DENCLUE~\citep{hinneburg1998efficient}, %CURE~\citep{guha1998cure}, %CHAMELEON~\citep{karypis1999chameleon}, %CLARANS~\citep{ng2002clarans}, %and HDBSCAN~\citep{campello2015hierarchical}. This paper focuses on an efficient implementation of the DBSCAN algorithm~\citep{ester1996density}, one of the most popular density-based clustering algorithms, whose consistent use earned it the SIG KDD 2014's Test of Time Award~\citep{SIGKDDNe30:online}, and OPTICS~\citep{ankerst1999optics}, often referred to as an extension of DBSCAN. %Matt - what do you mean when you say related algorithms? %along with their related algorithms, such as the Local Outlier Factor \citep{breunig2000lof} and the conversion methods between reachability and dendrogram representations\citep{sander2003automatic}. %Matt - you can cite the KAIS 17 paper in this first sentence While surveying software tools that implement various density-based clustering algorithms, it was discovered that in a large number of statistical tools, not only do implementations vary significantly in performance~\citep{kriegel2016black}, but may also lack important components and corrections. Specifically, for the statistical computing environment \proglang{R}~\citep{team2013r}, only naive DBSCAN implementations without speed-up with spatial data structures are available (e.g., in the well-known Flexible Procedures for Clustering package~\citep{fpc}), and OPTICS is not available. %% Matt, what packages? : fixed (fpc). It's probably not worth mentioning largeVis, doesn't even compile/load properly on my machine. This motivated the development of a \proglang{R} package for density-based clustering with DBSCAN and related algorithms called \pkg{dbscan}. The \pkg{dbscan} package contains complete, correct and fast implementations of DBSCAN and OPTICS. % precisely as intended by the original authors of the algorithms. The package currently enjoys thousands of new installations from the CRAN repository every month. This article presents an overview of the \proglang{R} package~\pkg{dbscan} focusing on DBSCAN and OPTICS, outlining its operation and experimentally compares its performance with implementations in other open-source implementations. We first review the concept of density-based clustering and present the DBSCAN and OPTICS algorithms in Section~\ref{sec:dbc}. This section concludes with a short review of existing software packages that implement these algorithms. Details about \pkg{dbscan}, with examples of its use, are presented in Section~\ref{sec:dbscan}. A performance evaluation is presented in Section~\ref{sec:eval}. Concluding remarks are offered in Section~\ref{sec:conc}. A version of this article describing the package \pkg{dbscan} was published as \cite{hahsler2019dbscan} and should be cited. <>= options(useFancyQuotes = FALSE) citation("dbscan") @ \section{Density-based clustering}\label{sec:dbc} Density-based clustering is now a well-studied field. Conceptually, the idea behind density-based clustering is simple: given a set of data points, define a structure that accurately reflects the underlying density~\citep{sander2011density}. An important distinction between density-based clustering and alternative approaches to cluster analysis, such as the use of \emph{(Gaussian) mixture models}~\citep[see][]{jain1999review}, is that the latter represents a \emph{parametric} approach in which the observed data are assumed to have been produced by mixture of either Gaussian or other parametric families of distributions. While certainly useful in many applications, parametric approaches naturally assume clusters will exhibit some type convex (generally hyper-spherical or hyper-elliptical) shape. Other approaches, such as $k$-means clustering (where the $k$ parameter signifies the user-specified number of clusters to find), share this common theme of `minimum variance', where the underlying assumption is made that ideal clusters are found by minimizing some measure of intra-cluster variance (often referred to as cluster cohesion) and maximizing the inter-cluster variance (cluster separation)~\citep{arbelaitz2013extensive}. Conversely, the label density-based clustering is used for methods which do not assume parametric distributions, are capable of finding arbitrarily-shaped clusters, handle varying amounts of noise, and require no prior knowledge regarding how to set the number of clusters $k$. This methodology is best expressed in the DBSCAN algorithm, which we discuss next. \subsection{DBSCAN: Density Based Spatial Clustering of Applications with Noise} As one of the most cited of the density-based clustering algorithms~\citep{acade96:online}, DBSCAN~\citep{ester1996density} is likely the best known density-based clustering algorithm in the scientific community today. The central idea behind DBSCAN and its extensions and revisions is the notion that points are assigned to the same cluster if they are \emph{density-reachable} from each other. To understand this concept, we will go through the most important definitions used in DBSCAN and related algorithms. The definitions and the presented pseudo code follows the original by \cite{ester1996density}, but are adapted to provide a more consistent presentation with the other algorithms discussed in the paper. Clustering starts with a dataset $D$ containing a set of points $p \in D$. Density-based algorithms need to obtain a density estimate over the data space. DBSCAN estimates the density around a point using the concept of $\epsilon$-neighborhood. \begin{mydef} {\bf $\epsilon$-Neighborhood}. The $\epsilon$-neighborhood, $N_\epsilon(p)$, of a data point $p$ is the set of points within a specified radius $\epsilon$ around $p$. $$N_\epsilon(p) = \{q \;|\; d(p,q) \le \epsilon\}$$ where $d$ is some distance measure and $\epsilon \in \mathbb{R}^+$. Note that the point $p$ is always in its own $\epsilon$-neighborhood, i.e., $p \in N_\epsilon(p)$ always holds. \end{mydef} % high density definition both below and above Following this definition, the size of the neighborhood $|N_\epsilon(p)|$ can be seen as a simple unnormalized kernel density estimate around $p$ using a uniform kernel and a bandwidth of $\epsilon$. DBSCAN uses $N_\epsilon(p)$ and a threshold called $\mathit{minPts}$ to detect dense regions and to classify the points in a data set into {\bf core}, {\bf border}, or {\bf noise} points. \begin{mydef} {\bf Point classes}. A point $p \in D$ is classified as \begin{itemize} \item a {\bf core point} if $N_\epsilon(p)$ has high density, i.e., $|N_\epsilon(p)| \geq \mathit{minPts}$ where $\mathit{minPts} \in \mathbb{Z}^+$ is a user-specified density threshold, \item a {\bf border point} if $p$ is not a core point, but it is in the neighborhood of a core point $q \in D$, i.e., $p \in N_\epsilon(q)$, or \item a {\bf noise point}, otherwise. \end{itemize} \end{mydef} \begin{figure} \minipage{0.49\textwidth} \includegraphics[height=\linewidth, angle=-90, origin=c]{figures/dbscan_a}\\ \centerline{(a)} \endminipage\hfill \minipage{0.49\textwidth} \includegraphics[height=\linewidth, angle=-90, origin=c]{figures/dbscan_b}\\ \centerline{(b)} \endminipage\\ \caption{Concepts used the DBSCAN family of algorithms. (a) shows examples for the three point classes, core, border, and noise points, (b) illustrates the concept of density-reachability and density-connectivity. }\label{fig:point_classes} \end{figure} A visual example is shown in Figure~\ref{fig:point_classes}(a). The size of the neighborhood for some points is shown as a circle and their class is shown as an annotation. To form contiguous dense regions from individual points, DBSCAN defines the notions of reachability and connectedness. \begin{mydef} {\bf Directly density-reachable}. A point $q \in D$ is directly density-reachable from a point $p \in D$ with respect to $\epsilon$ and $\mathit{minPts}$ if, and only if, \begin{enumerate} \item $|N_\epsilon(p)|$ $\geq$ $\mathit{minPts}$, and \item $q$ $\in$ $N_\epsilon(p)$. \end{enumerate} That is, $p$ is a core point and $q$ is in its $\epsilon$-neighborhood. \end{mydef} \begin{mydef} {\bf Density-reachable}. A point $p$ is density-reachable from $q$ if there exists in $D$ an ordered sequence of points $(p_1, p_2, ..., p_n)$ with $q=p_1$ and $p=p_n$ such that $p_i+1$ directly density-reachable from $p_{i}$ $\forall$ $i \in \{1,2, ..., n-1\}$. \end{mydef} \begin{mydef} {\bf Density-connected}. A point $p \in D$ is density-connected to a point $q \in D$ if there is a point $o$ $\in$ $D$ such that both $p$ and $q$ are density-reachable from $o$. \end{mydef} The notion of density-connection can be used to form clusters as contiguous dense regions. % DBSCAN definition of cluster \begin{mydef} {\bf Cluster}. A cluster $C$ is a non-empty subset of $D$ satisfying the following conditions: \begin{enumerate} \item {\bf Maximality}: If $p \in C$ and $q$ is density-reachable from $p$, then $q \in C$; and \item {\bf Connectivity}: $\forall$ $p, q \in C$, $p$ is density-connected to $q$. \end{enumerate} \end{mydef} The DBSCAN algorithm identifies all such clusters by finding all core points and expanding each to all density-reachable points. %Algorithm~\ref{alg:dbscan} presents the details of the DBSCAN implementation in \pkg{dbscan}. It largely follows the algorithm presented by \cite{ester1996density}, but presents DBSCAN and cluster expansion in a single function. The algorithm begins with an arbitrary point $p$ and retrieves its $\epsilon$-neighborhood. %, denoted $N_{\epsilon}(p)$. If it is a core point then it will start a new cluster that is expanded by assigning all points in its neighborhood to the cluster. If an additional core point is found in the neighborhood, then the search is expanded to include also all points in its neighborhood. If no more core points are found in the expanded neighborhood, then the cluster is complete and the remaining points are searched to see if another core point can be found to start a new cluster. %The algorithm returns the cluster assignments after all data points have been processed. After processing all points, points which were not assigned to a cluster are considered noise. %Note that border points are point which have been assigned a cluster, but are not core points. %\begin{algorithm}[t] % \caption{DBSCAN} % \begin{algorithmic}[1] % \Require $D \coloneqq$ Database of points % \Require $\epsilon \coloneqq$ User-defined neighborhood radius % \Require $\mathit{minPts} \coloneqq$ Minimum number of points in the neighborhood of a core point % \Function{DBSCAN}{D, eps, $\mathit{minPts}$} % \For{$p$ in $D$}% Iterate through the DB of points, arbitrary starting point % \Comment{Find core points} % \If {$p$ has already been visited} % Already Processed points are skipped % continue % \EndIf % % \State Mark $p$ as visited % Mark progress % \State $N \gets N_{\epsilon}(p)$ % Get all points within eps radius % \If{$|N| < \mathit{minPts}$} % How many points were found % continue % \EndIf % % \State $c \gets$ new cluster label % \Comment{Start new cluster for core point and expand} % \State Assign $p$ to cluster $c$ % \While {$N \ne \emptyset$} % \State $p' \gets pop(N)$ % \If {$p'$ has already been visited} % Already Processed points are skipped % continue % \EndIf % \State Mark $p'$ as visited % Mark progress % \State $N' \gets N_{\epsilon}(p')$ % Get all points within eps radius % \State Assign $p'$ to cluster $c$ % \If{$|N'| \ge \mathit{minPts}$} % How many points were found % \Comment{Expand cluster for additional core point} % \State Mark $p'$ as a core point % \State $N \gets N \cup N'$ % % \EndIf % % \EndWhile % \EndFor % \State \Return cluster assignments % \EndFunction % \end{algorithmic} %\label{alg:dbscan} %\end{algorithm} In the DBSCAN algorithm, core points are always part of the same cluster, independent of the order in which the points in the dataset are processed. This is different for border points. Border points might be density-reachable from core points in several clusters and the algorithm assigns them to the first of these clusters processed which depends on the order of the data points and the particular implementation of the algorithm. %Border points, however, although density-reachable from a core point, do not share the density-reachable property (the relation is asymmetric) and thus their cluster assignment depends on the order of which points are visited in the algorithm. This needs to be taken into account when comparing two different implementations since they might visit the points in a different order and thus end up producing different cluster assignments for border points. To alleviate this behavior, \cite{campello2015hierarchical} suggest a modification called DBSCAN* which considers all border points as noise instead and leaves them unassigned. \subsection{OPTICS: Ordering Points To Identify Clustering Structure}\label{sec:optics} There are many instances where it would be useful to detect clusters of varying density. From identifying causes among similar seawater characteristics~\citep{birant2007st}, to network intrusion detection systems~\citep{ertoz2003finding}, point of interest detection using geo-tagged photos~\citep{kisilevich2010p}, classifying cancerous skin lesions~\citep{celebi2005mining}, the motivations for detecting clusters among varying densities are numerous. The inability to find clusters of varying density is a notable drawback of DBSCAN resulting from the fact that a combination of a specific neighborhood size with a single density threshold $\mathrm{minPts}$ is used to determine if a point resides in a dense neighborhood. In 1999, some of the original DBSCAN authors developed OPTICS~\citep{ankerst1999optics} to address this concern. OPTICS borrows the core density-reachable concept from DBSCAN. But while DBSCAN may be thought of as a clustering algorithm, searching for natural groups in data, OPTICS is an \emph{augmented ordering algorithm} from which either flat or hierarchical clustering results can be derived. OPTICS requires the same $\epsilon$ and $\mathit{minPts}$ parameters as DBSCAN, however, the $\epsilon$ parameter is theoretically unnecessary and is only used for the practical purpose of reducing the runtime complexity of the algorithm. To describe OPTICS, we introduce an additional concepts called core-distance and reachability-distance. All used distances are calculated using the same metric (often Euclidean distance) used for the neighborhood calculation. \begin{mydef} {\bf Core-distance}. The core-distance of a point $p \in D$ with respect to $\mathit{minPts}$ and $\epsilon$ is defined as \[ \mathrm{core\mhyphen dist}(p; \epsilon, \mathit{minPts}) = \begin{cases} \text{UNDEFINED} & \text{if} \; |N_{\epsilon}(p)| < \mathit{minPts}, \text{and} \\ \mathrm{minPts\mhyphen dist}(p) & \text{otherwise.} \end{cases} \] where $\mathrm{minPts\mhyphen dist}(p)$ is the distance from $p$ to its $\mathit{minPts} - 1$ nearest neighbor, i.e., the minimal radius a neighborhood of size $\mathit{minPts}$ centered at and including $p$ would have. \end{mydef} \begin{mydef} {\bf Reachability-distance}. The reachability-distance of a point $p \in D$ to a point $q \in D$ parameterized by $\epsilon$ and $\mathit{minPts}$ is defined as \[ \mathrm{reachability\mhyphen dist}(p,q; \epsilon, \mathit{minPts}) = \begin{cases} \text{UNDEFINED} & \text{if} \; |N_{\epsilon}(p)| < \mathit{minPts}, \text{and} \\ \max(\mathrm{core\mhyphen dist}(p), d(p, q)) & \text{otherwise.} \end{cases} \] \end{mydef} The reachability-distance of a core point $p$ with respect to object $q$ is the smallest neighborhood radius such that $p$ would be directly density-reachable from $q$. Note that $\epsilon$ is typically set very large compared to DBSCAN. Therefore, $\mathit{minPts}$ behaves differently for OPTICS: more points will be considered core points and it affects how many nearest neighbors are considered in the core-distance calculation, where larger values will lead to larger and more smooth reachability distributions. This needs to be kept in mind when choosing appropriate parameters. %The OPTICS algorithm pseudocode is shown in Algorithm~\ref{alg:optics}. OPTICS provides an augmented ordering. %sorts points by the reachability-distance to their closest core point. The algorithm starting with a point and expands it's neighborhood like DBSCAN, but it explores the new point in the order of lowest to highest core-distance. The order in which the points are explored along with each point's core- and reachability-distance is the final result of the algorithm. An example of the order and the resulting reachability-distance is shown in the form of a reachability plot in Figure~\ref{fig:opticsReachPlot1}. Low reachability-distances shown as valleys represent clusters separated by peaks representing points with larger distances. This density representation essentially conveys the same information as the often used dendrogram or `tree-like' structure. This is why OPTICS is often also noted as a visualization tool. \cite{sander2003automatic} showed how the output of OPTICS can be converted into an equivalent dendrogram, and that under certain conditions, the dendrogram produced by the well known hierarchical clustering with single linkage is identical to running OPTICS with the parameter $\mathit{minPts} = 2$ %To make this connection explicit, an OPTICS extension~\citep{sander2003automatic} showed how that, under certain conditions, the dendrogram produced by the well known hierarchical clustering with single linkage is identical to running OPTICS with the parameter $\mathit{minPts} = 2$. Due to the widespread usage of dendrograms in %the \proglang{R} computing environment, this conversion algorithm between reachability and dendrogram representations is made available in \pkg{dbscan}. \begin{figure} \centering \includegraphics{dbscan-opticsReachPlot} \caption{OPTICS reachability plot example for a data set with four clusters of 100 data points each.} \label{fig:opticsReachPlot1} \end{figure} % %OPTICS evaluates each point's reachability-distance with respect to a neighbor, marks the point as processed, and then continues processing nearest neighbors. %The algorithm is similar to DBSCAN. Where OPTICS differs, however, is in the assignment of reachability-distance, a generalized extension to density-reachability. Rather than assigning cluster labels for each object processed, OPTICS stores reachability-distance and core-distance , in a specific ordering such that neighboring objects that have smaller reachability-distances are prioritized. Due to this prioritization, core objects are naturally grouped up near other core objects in the ordering, where each point is labeled with its minimum reachability-distance. An overview of the algorithm is shown below in Algorithm \ref{alg:optics}. % OPTICS(DB, eps, \mathit{minPts}) % for each point p of DB % p.reachability-distance = UNDEFINED % for each unprocessed point p of DB % N = getNeighbors(p, eps) % mark p as processed % output p to the ordered list % if (core-distance(p, eps, \mathit{minPts}) != UNDEFINED) % Seeds = empty priority queue % update(N, p, Seeds, eps, \mathit{minPts}) % for each next q in Seeds % N' = getNeighbors(q, eps) % mark q as processed % output q to the ordered list % if (core-distance(q, eps, \mathit{minPts}) != UNDEFINED) % update(N', q, Seeds, eps, \mathit{minPts}) %% \begin{algorithm}[tp] % \caption{OPTICS} % \begin{algorithmic}[1] % \Require $D \coloneqq$ Database of points % \Require $\epsilon \coloneqq$ User-defined neighborhood radius % \Require $\mathit{minPts} \coloneqq$ Minimum number of points in the neighborhood of a core point % \Function{OPTICS}{D, $\epsilon$, $\mathit{minPts}$} % \For{$p$ in $D$}% Iterate through the DB of points, arbitrary starting point % \If {$p$ has been processed} % Already Processed points are skipped % \textit{continue} % \EndIf % \State $N \gets N_{\epsilon}(p)$ \Comment{expand cluster order} % % Get all points within eps radius % \State Mark $p$ as processed % Mark progress % \State queue $\gets p$ % \If{$|N| \ \geq \mathit{minPts}$} % \State $Seeds \gets \text{ < empty priority queue > }$ % \State update($N$, $p$, $Seeds$, $\epsilon$, $\mathit{minPts}$) % \For{$q$ in $Seeds$} % \State $N' \gets N_{\epsilon}(q)$ % \State Mark $q$ as processed % Mark progress % \State queue $\gets q$ % \If{$|N'| \ \geq \mathit{minPts}$} % How many points were found % \State update($N'$, $p$, $Seeds$, $\epsilon$, $\mathit{minPts}$) % \EndIf % \EndFor % \EndIf % \EndFor % \State \Return core-distances % % A call to a function that extracts clusters should be mentioned, but we dont need to specify the extractdbscan or opitics-xi algorithms. % % Mentioned below? % \EndFunction % \end{algorithmic} %\label{alg:optics} %\end{algorithm} % %\begin{algorithm}[tp] % \caption{update} % \begin{algorithmic}[1] % \Require $N \coloneqq$ NeighborPts % \Require $p \coloneqq$ Current point to process % \Require $Seeds \coloneqq$ Priority Queue of known, unprocessed cluster members % \Require $\epsilon \coloneqq$ User-defined $\epsilon$ radius to consider % \Require $\mathit{minPts} \coloneqq$ The minimum number of points that constitute a cluster % \Function{update}{N, p, Seeds, $\epsilon$, $\mathit{minPts}$} % \State $p_\mathrm{core\mhyphen dist} \coloneqq \mathrm{core\mhyphen dist}(p, \epsilon, \mathit{minPts})$ % \For{$o$ in $N$} % \If {$o$ has not been processed} % Already Processed points are skipped % \State $new_{rd} \coloneqq \max(p_{\mathrm{core\mhyphen dist}}, d(p, o))$ % \If{$o_{rd} == \text{UNDEFINED}$} % \State $o_{rd} \gets new_{rd}$ % \State $Seeds.insert \mhyphen with \mhyphen priority(o, o_{rd})$ % \Else % \If{$new_{rd} < o_{rd}$} % \State $o_{rd} \gets new_{rd}$ % \State $Seeds.move \mhyphen up(o, new_{rd})$ % \EndIf % \EndIf % \EndIf % \EndFor % \EndFunction % \end{algorithmic} % \label{alg:update} %\end{algorithm} %\subsubsection{Cluster Extraction}\label{sub:opt_cluster_ex} From the order discovered by OPTICS, two ways to group points into clusters was discussed in ~\cite{ankerst1999optics}, one which we will refer to as the {\bf ExtractDBSCAN} method and one which we will refer to as the {\bf Extract-$\xi$} method summarized below: \begin{enumerate} \item {\bf ExtractDBSCAN} uses a single global reachability-distance threshold $\epsilon'$ to extract a clustering. This can be seen as a horizontal line in the reachability plot in~\ref{fig:opticsReachPlot1}. Peaks above the cut-off represent noise points and separate the clusters. \item {\bf Extract-$\xi$} identifies clusters \emph{hierarchically} by scanning through the ordering that OPTICS produces to identify significant, relative changes in reachability-distance. The authors of OPTICS noted that clusters can be thought of as identifying `dents' in the reachability plot. \end{enumerate} The ExtractDBSCAN method extracts a clustering equivalent to DBSCAN* (i.e., DBSCAN where border points stay unassigned). Because this method extracts clusters like DBSCAN, it cannot identify partitions that exhibit very significant differences in density. Clusters of significantly different density can only be identified if the data is well separated and very little noise is present. The second method, which we call Extract-$\xi$\footnote{In the original OPTICS publication \cite{ankerst1999optics}, the algorithm was outlined in Figure 19 and called the 'ExtractClusters' algorithm, where the clusters extracted were referred to as $\xi$-clusters. To distinguish the method uniquely, we refer to it as the Extract-$\xi$ method.}, identifies a cluster hierarchy and replaces the data dependent global $\epsilon$ parameter with $\xi$, a data-independent density-threshold parameter ranging between $0$ and $1$. One interpretation of $\xi$ is that it describes the relative magnitude of the change of cluster density (i.e., reachability). Significant changes in relative reachability allow for clusters to manifest themselves hierarchically as `dents' in the ordering structure. The hierarchical representation Extract-$\xi$ can, as opposed to the ExtractDBSCAN method, produce clusters of varying densities. With its two ways of extracting clusters from the ordering, whether through either the global $\epsilon'$ or relative $\xi$ threshold, OPTICS can be seen as a generalization of DBSCAN. In contexts where one wants to find clusters of similar density, OPTICS's ExtractDBSCAN yields a DBSCAN-like solution, while in other contexts Extract-$\xi$ can generate a hierarchy representing clusters of varying density. It is thus interesting to note that while DBSCAN has reached critical acclaim, even motivating numerous extensions~\citep{rehman2014dbscan}, OPTICS has received decidedly less attention. Perhaps one of the reasons for this is because the Extract-$\xi$ method for grouping points into clusters has gone largely unnoticed, as it is not implemented in most open-source software packages that advertise an implementation of OPTICS. This includes implementations in WEKA~\citep{hall2009weka}, SPMF~\citep{fournier2014spmf}, and the PyClustering~\citep{PyCluste54:online} and Scikit-learn~\citep{pedregosa2011scikit} libraries for Python. To the best of our knowledge, the only other open-source library sporting a complete implementation of OPTICS is ELKI~\citep{DBLP:journals/pvldb/SchubertKEZSZ15}, written in \proglang{Java}. %\subsection{A Note on DBSCAN and OPTICS Extensions}\label{sec:extensions} In fact, perhaps due to the (incomplete) implementations of OPTICS cluster extraction across various software libraries, there has been some confusion regarding the usage of OPTICS, and the benefits it offers compared to DBSCAN. Several papers motivate DBSCAN extensions or devise new algorithms by citing OPTICS as incapable of finding density-heterogeneous clusters~\citep{ghanbarpour2014exdbscan,chowdhury2010efficient,Gupta2010,duan2007local}. Along the same line of thought, others cite OPTICS as capable of finding clusters of varying density, but either use the DBSCAN-like global density threshold extraction method or refer to OPTICS as a clustering algorithm, without mention of which cluster extraction method was used in their experimentation~\citep{verma2012comparative,roy2005approach,liu2007vdbscan,pei2009decode}. However, OPTICS fundamentally returns an ordering of the data which can be post-processed to extract either 1) a flat clustering with clusters of relatively similar density or 2) a cluster hierarchy, which is adaptive to representing local densities within the data. To clear up this confusion, it seems to be important to add complete implementations to existing software packages and introduce new complete implementations of OPTICS like the \proglang{R} package~\pkg{dbscan} described in this paper. \subsection{Current implementations of DBSCAN and OPTICS}\label{sec:review} Implementations of DBSCAN and/or OPTICS are available in many statistical software packages. We focus here on open-source solutions. These include the Waikato Environment for Knowledge Analysis (WEKA)~\citep{hall2009weka}, the Sequential Pattern Mining Framework (SPMF)~\citep{fournier2014spmf}, the Environment for Developing KDD-Application supported by Index Structures (ELKI)~\citep{DBLP:journals/pvldb/SchubertKEZSZ15}, the Python %% Matt - need a cite for PyClustering. library scikit-learn~\citep{pedregosa2011scikit}, the PyClustering Data Mining library~\citep{PyCluste54:online}, the Flexible Procedures for Clustering \proglang{R} package~\citep{fpc}, and the \pkg{dbscan} package~\citep{dbscan-R} introduced in this paper. \begin{table} \begin{tabularx}{\textwidth}{ c c c c c X } \hline {\bf Library} & {\bf DBSCAN} & {\bf OPTICS} & {\bf ExtractDBSCAN} & {\bf Extract-$\xi$} & \\ \hline \rule{0pt}{3ex} \pkg{dbscan} & \cmark & \cmark & \cmark & \cmark & \\ ELKI & \cmark & \cmark & \cmark & \cmark & \\ SPMF & \cmark & \cmark & \cmark & & \\ PyClustering & \cmark & \cmark & \cmark & & \\ WEKA & \cmark & \cmark & \cmark & & \\ SCIKIT-LEARN & \cmark & & & & \\ FPC & \cmark & & & & \\ \hline \end{tabularx} \vspace{2mm} \begin{tabularx}{\textwidth}{ c c c c X } \hline {\bf Library} & {\bf Index Acceleration} & {\bf Dendrogram for OPTICS} & {\bf Language} & \\ \hline \rule{0pt}{3ex} \pkg{dbscan} & \cmark & \cmark & \proglang{R} & \\ ELKI & \cmark & \cmark & \proglang{Java} & \\ SPMF & \cmark & & \proglang{Java} & \\ PyClustering & \cmark & & \proglang{Python} & \\ WEKA & & & \proglang{Java} & \\ SCIKIT-LEARN & \cmark & & \proglang{Python} & \\ FPC & & & \proglang{R} & \\ \hline \end{tabularx} \caption{A Comparison of DBSCAN and OPTICS implementations in various open-source statistical software libraries. A \cmark \ symbol denotes availability.} \label{tab:comp} \end{table} Table~\ref{tab:comp} presents a comparison of the features offered by these packages. All packages support DBSCAN and most use index acceleration to speed up the $\epsilon$-neighborhood queries involved in both DBSCAN and OPTICS algorithms, the known bottleneck that typically dominates the runtime and is essential for processing larger data sets. \pkg{dbscan} is the first \proglang{R} implementation offering this improvement. OPTICS with ExtractDBSCAN is also widely implemented, but the Extract-$\xi$ method, as well as the use of dendrograms with OPTICS, is only available in \pkg{dbscan} and ELKI. %It is notable that there still remain minor discrepancies between the implementations (see Completeness subsection %in Section~\ref{sec:eval} for details). A small experimental runtime comparison is provided in Section~\ref{sec:eval}. \section{The dbscan package}\label{sec:dbscan} The package \pkg{dbscan} provides high performance code for DBSCAN and OPTICS through a \proglang{C++} implementation (interfaced via the \pkg{Rcpp} package by \cite{eddelbuettel2011rcpp}) using the $k$-d tree data structure implemented in the \proglang{C++} library ANN~\citep{mount1998ann} to improve $k$ nearest neighbor (kNN) and fixed-radius nearest neighbor search speed. DBSCAN and OPTICS share a similar interface. \begin{Schunk} \begin{Sinput} dbscan(x, eps, minPts = 5, weights = NULL, borderPoints = TRUE, ...) optics(x, eps, minPts = 5, ...) \end{Sinput} \end{Schunk} The first argument \code{x} is the data set in form of a \code{data.frame} or a \code{matrix}. The implementations use by default Euclidean distance for neighborhood computation. Alternatively, a precomputed set of pair-wise distances between data points stored in a \code{dist} object can be supplied. Using precomputed distances, arbitrary distance metrics can be used, however, note that $k$-d trees are not used for distance data, but lists of nearest neighbors are precomputed. For \code{dbscan()} and \code{optics()}, the parameter \code{eps} represents the radius of the $\epsilon$-neighborhood considered for density estimation and \code{minPts} represents the density threshold to identify core points. Note that \code{eps} is not strictly necessary for OPTICS but is only used as an upper limit for the considered neighborhood size used to reduce computational complexity. \code{dbscan()} also can use weights for the data points in \code{x}. The density in a neighborhood is just the sum of the weights of the points inside the neighborhood. By default, each data point has a weight of one, so the density estimate for the neighborhood is just the number of data points inside the neighborhood. %This is the reason why the density threshold is called minPoints, i.e., the minimum number of required points in the eps-neighborhood. Using weights, the importance of points can be changed. The original DBSCAN implementation assigns border points to the first cluster it is density reachable from. Since this may result in different clustering results if the data points are processed in a different order, \cite{campello2015hierarchical} suggest for DBSCAN* to consider border points as noise. This can be achieved by using \code{borderPoints = FALSE}. All functions accept additional arguments. % in~\code{...}. These arguments are passed on to the fixed-radius nearest neighbor search. More details about the implementation of the nearest neighbor search will be presented in Section~\ref{sec:nn} below. Clusters can be extracted from the linear order produced by OPTICS. The \pkg{dbscan} implementation of the cluster extraction methods for ExtractDBSCAN and Extract-$\xi$ are: \begin{Schunk} \begin{Sinput} extractDBSCAN(object, eps_cl) extractXi(object, xi, minimum = FALSE, correctPredecessor = TRUE) \end{Sinput} \end{Schunk} \code{extractDBSCAN()} extracts a clustering from an OPTICS ordering that is similar to what DBSCAN would produce with a single global $\epsilon$ set to \code{eps_cl}. \code{extractXi()} extracts clusters hierarchically based on the steepness of the reachability plot. \code{minimum} controls whether only the minimal (non-overlapping) cluster are extracted. \code{correctPredecessor} corrects a common artifact known of the original $\xi$ method presented in~\cite{ankerst1999optics} by pruning the steep up area for points that have predecessors not in the cluster (see Technical Note in Appendix~\ref{sec:technote} for details). \subsection{Nearest Neighbor Search}\label{sec:nn} The density based algorithms in \pkg{dbscan} rely heavily on forming neighborhoods, i.e., finding all points belonging to an $\epsilon$-neighborhood. A simple approach is to perform a linear search, i.e., always calculating the distances to all other points to find the closest points. This requires $O(n)$ operations, with $n$ being the number of data points, for each time a neighborhood is needed. Since DBSCAN and OPTICS process each data point once, this results in a $O(n^2)$ runtime complexity. A convenient way in \proglang{R} is to compute a distance matrix with all pairwise distances between points and sort the distances for each point (row in the distance matrix) to precompute the nearest neighbors for each point. However, this method has the drawback that the size of the full distance matrix is $O(n^2)$, and becomes very large and slow to compute for medium to large data sets. In order to avoid computing the complete distance matrix, \pkg{dbscan} relies on a space-partitioning data structure called a $k$-d trees~\citep{bentley1975multidimensional}. This data structure allows \pkg{dbscan} to identify the kNN or all neighbors within a fixed radius $eps$ more efficiently in sub-linear time using on average only $O(\mathop{log}(n))$ operations per query. This results in a reduced runtime complexity of $O(n\mathop{log}(n))$. However, note that $k$-d trees are known to degenerate for high-dimensional data requiring $O(n)$ operations and leading to a performance no better than linear search. % See above %However, for high-dimensional data, $k$-d trees are known to degenerate %resulting again in a runtime complexity of $O(n^2)$. Fast kNN search and fixed-radius nearest neighbor search are used in DBSCAN and OPTICS, but we also provide a direct interface in \pkg{dbscan}, since they are useful in their own right. \begin{Schunk} \begin{Sinput} kNN(x, k, sort = TRUE, search = "kdtree", bucketSize = 10, splitRule = "suggest", approx = 0) frNN(x, eps, sort = TRUE, search = "kdtree", bucketSize = 10, splitRule = "suggest", approx = 0) \end{Sinput} \end{Schunk} The interfaces only differ in the way that \code{kNN()} requires to specify \code{k} while \code{frNN()} needs the radius \code{eps}. All other arguments are the same. \code{x} is the data and the result will be a list of neighbors in \code{x} for each point in \code{x}. \code{sort} controls if the returned points are sorted by distance. \code{search} controls what searching method should be used. Available search methods are \code{"kdtree"}, \code{"linear"} and \code{"dist"}. The linear search method does not build a search data structure, but performs a complete linear search to find the nearest neighbors. %This is typically slow for large data sets, however, The dist method precomputes a dissimilarity matrix which is very fast for small data sets, but problematic for large sets. The default method is to build a $k$-d tree. $k$-d trees are implemented in \proglang{C++} using a modified version of the ANN library \citep{mount1998ann} compiled for Euclidean distances. Parameters \code{bucketSize}, \code{splitRule} and \code{approx} are algorithmic parameters which control the way the $k$-d tree is built. \code{bucketSize} controls the maximal size of the $k$-d tree leaf nodes. \code{splitRule} specifies the method how the $k$-d tree partitions the data space. We use \code{"suggest"}, which uses the best guess of the ANN library given the data. \code{approx} greater than zero uses approximate NN search. Only nearest neighbors up to a distance of a factor of $(1+\mathrm{approx})\mathrm{eps}$ will be returned, but some actual neighbors may be omitted potentially leading to spurious clusters and noise points. However, the algorithm will enjoy a significant speedup. For more details, we refer the reader to the documentation of the ANN library~\citep{mount1998ann}. \code{dbscan()} and \code{optics()} use internally \code{frNN()} and the additional arguments in~\code{...} are passed on to the nearest neighbor search method. % \section{Using the dbscan package} \subsection{Clustering with DBSCAN} We use a very simple artificial data set of four slightly overlapping Gaussians in two-dimensional space with 100 points each. We load \pkg{dbscan}, set the random number generator to make the results reproducible and create the data set. <>= options(width = 75) @ <<>>= library("dbscan") set.seed(2) n <- 400 x <- cbind( x = runif(4, 0, 1) + rnorm(n, sd = 0.1), y = runif(4, 0, 1) + rnorm(n, sd = 0.1) ) true_clusters <- rep(1:4, time = 100) @ <>= plot(x, col = true_clusters, pch = true_clusters) @ \begin{figure} \centering \includegraphics[width=8cm]{dbscan-sampleData} \caption{The sample dataset, consisting of 4 noisy Gaussian distributions with slight overlap.} \label{fig:sampleData} \end{figure} The resulting data set is shown in Figure~\ref{fig:sampleData}. To apply DBSCAN, we need to decide on the neighborhood radius~\code{eps} and the density threshold~\code{minPts}. The rule of thumb for minPts is to use at least the number of dimensions of the data set plus one. In our case, this is 3. For eps, we can plot the points' kNN distances (i.e., the distance to the $k$th nearest neighbor) in decreasing order and look for a knee in the plot. The idea behind this heuristic is that points located inside of clusters will have a small $k$-nearest neighbor distance, because they are close to other points in the same cluster, while noise points are isolated and will have a rather large kNN distance. \pkg{dbscan} provides a function called \code{kNNdistplot()} to make this easier. For $k$ we use \code{minPts} - 1 since DBSCAN's \code{minPts} include the actual data point and the $k$th nearest neighbors distance does not. <>= kNNdistplot(x, k = 2) abline(h=.06, col = "red", lty=2) @ \begin{figure} \centering \includegraphics{dbscan-kNNdistplot} \caption{$k$-Nearest Neighbor Distance plot.} \label{fig:kNNdistplot} \end{figure} The kNN distance plot is shown in Figure~\ref{fig:kNNdistplot}. A knee is visible at around a 2-NN distance of 0.06. We have manually added a horizontal line for reference. Now we can perform the clustering with the chosen parameters. <<>>= res <- dbscan(x, eps = 0.06, minPts = 3) res @ The resulting clustering identified one large cluster with 191 member points, two medium clusters with around 90 points, several very small clusters and 15 noise points (represented by cluster id 0). The available fields can be directly accessed using the list extraction operator \code{$}. For example, the cluster assignment information can be used to plot the data with the clusters identified by different labels and colors. <>= plot(x, col = res$cluster + 1L, pch = res$cluster + 1L) @ \begin{figure} \centering \includegraphics[width=9cm]{dbscan-dbscanPlot} \caption{Result of clustering with DBSCAN. Noise is represented as black circles.} \label{fig:dbscanPlot} \end{figure} The scatter plot in Figure~\ref{fig:dbscanPlot} shows that the clustering algorithm correctly identified the upper two clusters, but merged the lower two clusters because the region between them has a high enough density. The small clusters are isolated groups of 3 points (passing $\mathit{minPts}$) and the noise points isolated points. These small clusters can be suppressed by using a larger number for \code{minPts}. \pkg{dbscan} also provides a plot that adds convex cluster hulls to the scatter plot shown in Figure~\ref{fig:dbscanHullPlot}. <>= hullplot(x, res) @ \begin{figure} \centering \includegraphics[width=9cm]{dbscan-dbscanHullPlot} \caption{Convex hull plot of the DBSCAN clustering. Noise points are black. Note that noise points and points of another cluster may lie within the convex hull of a different cluster. } \label{fig:dbscanHullPlot} \vspace{0.1cm} \end{figure} A clustering can also be used to find out to which clusters new data points would be assigned using \code{predict(object, newdata = NULL, data, ...)}. The predict method uses nearest neighbor assignment to core points and needs the original dataset. Additional parameters %(\code{...}) are passed on to the nearest neighbor search method. Here we obtain the cluster assignment for the first 25 data points. Note that an assignment to cluster~0 means that the data point is considered noise because it is not close enough to a core point. <<>>= predict(res, x[1:25,], data = x) @ \subsection{Clustering with OPTICS} Unless OPTICS is purely used to extract a DBSCAN clustering, its parameters have a different effect than for DBSCAN: \code{eps} is typically chosen rather large (we use 10 here) and \code{minPts} mostly affects core and reachability-distance calculation, where larger values have a smoothing effect. We use also 10, i.e., the core-distance is defined as the distance to the 9th nearest neighbor (spanning a neighborhood of 10 points). <<>>= res <- optics(x, eps = 10, minPts = 10) res @ OPTICS is an augmented ordering algorithm, which stores the computed order of the points it found in the \code{order} element of the returned object. <<>>= head(res$order, n = 15) @ This means that data point 1 in the data set is the first in the order, data point 363 is the second and so forth. The density-based order produced by OPTICS can be directly plotted as a reachability plot. <>= plot(res) @ \begin{figure} \centering \includegraphics{dbscan-opticsReachPlot} \caption{OPTICS reachability plot. Note that the first reachability value is always UNDEFINED.} \label{fig:opticsReachPlot} \end{figure} The reachability plot in Figure~\ref{fig:opticsReachPlot} shows the reachability distance for points ordered by OPTICS. Valleys represent potential clusters separated by peaks. Very high peaks may indicate noise points. To visualize the order on the original data sets we can plot a line connecting the points in order. <>= plot(x, col = "grey") polygon(x[res$order,], ) @ \begin{figure} \centering \includegraphics[width=8cm]{dbscan-opticsOrder} \caption{OPTICS order of data points represented as a line.} \label{fig:opticsOrder} \end{figure} Figure~\ref{fig:opticsOrder} shows that points in each cluster are visited in consecutive order starting with the points in the center (the densest region) and then the points in the surrounding area. As noted in Section~\ref{sec:optics}, OPTICS has two primary cluster extraction methods using the ordered reachability structure it produces. A DBSCAN-type clustering can be extracted using \code{extractDBSCAN()} by specifying the global eps parameter. The reachability plot in figure~\ref{fig:opticsReachPlot} shows four peaks, i.e., points with a high reachability-distance. These points indicate boundaries between clusters four clusters. An \code{eps} threshold that separates the four clusters can be visually determined. In this case we use \code{eps_cl} of 0.065. <>= res <- extractDBSCAN(res, eps_cl = .065) plot(res) @ <>= hullplot(x, res) @ \begin{figure} \centering \includegraphics{dbscan-extractDBSCANReachPlot2} \caption{Reachability plot for a DBSCAN-type clustering extracted at global $\epsilon = 0.065$ results in four clusters.} \label{fig:extractDBSCANReachPlot2} \centering \includegraphics[width=9cm]{dbscan-extractDBSCANHullPlot2} \caption{Convex hull plot for a DBSCAN-type clustering extracted at global $\epsilon = 0.065$ results in four clusters.} \label{fig:extractDBSCANHullPlot2} \end{figure} The resulting reachability and corresponding clusters are shown in Figures~\ref{fig:extractDBSCANReachPlot2} and \ref{fig:extractDBSCANHullPlot2}. The clustering resembles closely the original structure of the four clusters with which the data were generated, with the only difference that points on the boundary of the clusters are marked as noise points. \pkg{dbscan} also provides \code{extractXi()} to extract a hierarchical cluster structure. We use here a \code{xi} value of 0.05. <<>>= res <- extractXi(res, xi = 0.05) res @ The $\xi$ method results in a hierarchical clustering structure, and thus points can be members of several nested clusters. Clusters are represented as contiguous ranges in the reachability plot and are available the field \code{clusters_xi}. <<>>= res$clusters_xi @ Here we have seven clusters. The clusters are also visible in the reachability plot. <>= plot(res) @ <>= hullplot(x, res) @ \begin{figure} \centering \includegraphics{dbscan-extractXiReachPlot} \caption{Reachability plot of a hierarchical clustering extracted with Extract-$\xi$.} \label{fig:extractXiReachPlot} %\end{figure} %\begin{figure}[htb] \centering \includegraphics[width=9cm]{dbscan-extractXiHullPlot} \caption{Convex hull plot of a hierarchical clustering extracted with Extract-$\xi$.} \label{fig:extractXiHullPlot} \end{figure} Figure~\ref{fig:extractXiReachPlot} shows the reachability plot with clusters represented using colors and vertical bars below the plot. The clusters themselves can also be plotted with the convex hull plot function shown in Figure~\ref{fig:extractXiHullPlot}. Note how the nested structure is shown by clusters inside of clusters. Also note that it is possible for the convex hull, while useful for visualizations, to contain a point that is not considered as part of a cluster grouping. %\subsection{LOF} %The Local Outlier Factor score can be computed as follows %\ifdefined\USESWEAVE %<<>>= %lof <- lof(x, k=3) %summary(lof) %@ %The distribution of outlier factors can be view simply using the specialized hist function: %<>= %hist(lof, breaks=20) %@ %\begin{figure} % \centering % \includegraphics{dbscan-LOF_hist} % \caption{LOF outlier histogram.} % \label{fig:LOF_hist} %\end{figure} % %The outlier factor can be visualized in a scatter plot through the following: %<>= %plot(x, pch = ".", main = "LOF (k=3)") %points(x, cex = (lof-1)*3, pch = 1, col="red") %text(x[lof>2,], labels = round(lof, 1)[lof>2], pos = 3) %@ %\begin{figure} % \centering % \includegraphics[width=9cm]{dbscan-LOF_plot} % \caption{Visualization of the local outlier factor of each point in the data set.} % \label{fig:LOF_plot} %\end{figure} %\else %\fi \subsection{Reachability and Dendrograms} %The \pkg{dbscan} package contains a variety of visualization options. Reachability plots can be converted into equivalent dendrograms \citep{sander2003automatic}. \pkg{dbscan} contains a fast implementation of the reachability-to-dendrogram conversion algorithm through the use of a disjoint-set data structure~\citep{cormen2001introduction, patwary2010experiments}, allowing the user to choose which hierarchical representation they prefer. The conversion algorithm can be directly called for OPTICS objects using the coercion method \code{as.dendrogram()}. <<>>= dend <- as.dendrogram(res) dend @ The dendrogram can be plotted using the standard plot method. <>= plot(dend, ylab = "Reachability dist.", leaflab = "none") @ \begin{figure}[t] \centering \includegraphics{dbscan-opticsDendrogram} \caption{Dendrogram structure of OPTICS reordering.} \label{fig:opticsDendrogram} \end{figure} Note how the dendrogram in Figure~\ref{fig:opticsDendrogram} closely resembles the reachability plots with added binary splits. Since the object is a standard dendrogram (from package \pkg{stats}), it can be used like any other dendrogram created with hierarchical clustering. \section{Performance Comparison}\label{sec:eval} \begin{table} \begin{center} \begin{tabular}{ c c c } \hline {\bf Data set} & \bf{Size} & \bf{Dimension}\\ \hline Aggregation & 788 & 2\\ Compound & 399 & 2\\ D31 & 3,100 & 2 \\ flame & 240 & 2 \\ jain & 373 & 2 \\ pathbased & 300 & 2 \\ R15 & 600 & 2 \\ s1 & 5,000 & 2 \\ s4 & 5,000 & 2 \\ spiral & 312 & 2\\ t4.8k & 8,000 & 2 \\ synth1 & 1000 & 3 \\ synth2 & 1000 & 10 \\ synth3 & 1000 & 100 \\ \hline \end{tabular} \end{center} \caption{Datasets used for comparison.} \label{tab:dsizes} \end{table} Finally, we evaluate the performance of \pkg{dbscan}'s implementation of DBSCAN and OPTICS against other open-source implementations. This is not a comprehensive evaluation study, but is used to demonstrate the performance of \pkg{dbscan}'s DBSCAN and OPTICS implementation on datasets of varying sizes as compared to other software packages. A comparative test was performed using both DBSCAN and OPTICS algorithms, where supported, for the libraries listed in Table~\ref{tab:comp} on page~\pageref{tab:comp}. The used datasets and their sizes are listed in Table~\ref{tab:dsizes}. The data sets tested include s1 and s2, the randomly generated but moderately-separated Gaussian clusters often used for agglomerative cluster analysis~\citep{Ssets}, the R15 validation data set used for maximum variance based clustering approach by \cite{veenman2002maximum}, the well-known spatial data set t4.8k used for validation of the CHAMELEON algorithm~\citep{karypis1999chameleon}, along with a variety of shape data sets commonly found in clustering validation papers~\citep{gionis2007clustering, zahn1971graph, chang2008robust, jain2005law, fu2007flame}. In 2019, we performed a comparison between \pkg{dbscan} 0.9-8, \pkg{fpc} 2.1-10, ELKI version 0.7, PyClustering 0.6.6, SPMF v2.10, WEKA 3.8.0, SciKit-Learn 0.17.1 on a MacBook Pro equipped with a 2.5 GHz Intel Core i7 processor, running OS X El Capitan 10.11.6. Note that newer versions of all mentioned software packages have been released since then. Changes in data structures and added optimization may result in significant improvements in runtime for different packages. All data sets where normalized to the unit interval, [0, 1], per dimension to standardize neighbor queries. For all data sets we used $\mathit{minPts} = 2$ and $\epsilon = 0.10$ for DBSCAN. For OPTICS, $\mathit{minPts} = 2$ with a large $\epsilon = 1$ was used. We replicated each run for each data set 15 times and report the average runtime here. Figures~\ref{fig:dbscan_bench} and \ref{fig:optics_bench} shows the runtimes. The datasets are sorted from easiest to hardest and the algorithm in the legend are sorted from on average fastest to slowest. Dimensionality, used distance function, data set size, and other data characteristics have a substantial impact on runtime performance. The results show that the implementation in $\pkg{dbscan}$ compares very favorably to the other implementations (but note that we did not enable data indexing in ELKI, and used a very small $\mathit{minPts}$). \begin{figure} \centering \includegraphics[width=0.80\textwidth]{figures/dbscan_benchmark} \caption{Runtime of DBSCAN in milliseconds (y-axis, logarithmic scale) vs. the name of the data set tested (x-axis).} \label{fig:dbscan_bench} \end{figure} \begin{figure} \centering \includegraphics[width=0.80\textwidth]{figures/optics_benchmark} \caption{Runtime of OPTICS in milliseconds (y-axis, logarithmic scale) vs. the name of the data set tested (x-axis).} \label{fig:optics_bench} \end{figure} % Clear page for Before Conclusind Remarks %\clearpage \section{Concluding Remarks}\label{sec:conc} The \pkg{dbscan} package offers a set of scalable, robust, and complete implementations of popular density-based clustering algorithms from the DBSCAN family. The main features of \pkg{dbscan} are a simple interface to fast clustering and cluster extraction algorithms, extensible data structures and methods for both density-based clustering visualization and representation including efficient conversion algorithms between OPTICS ordering and dendrograms. In addition to DBSCAN and OPTICS discussed in this paper, \pkg{dbscan} also contains a fast version of the local outlier factor (LOF) algorithm~\citep{breunig2000lof} and an implementation of HDBSCAN~\citep{campello2015hierarchical} is under development. \section{Acknowledgments} This work is partially supported by industrial and government partners at the Center for Surveillance Research, a National Science Foundation I/UCRC. %\clearpage \bibliography{dbscan} \clearpage \appendix \section{Technical Note on OPTICS cluster extraction}\label{sec:technote} Of the two cluster extraction methods outlined in the publication, the flat DBSCAN-type extraction method seems to remain the defacto clustering method implemented across most statistical software for OPTICS. However, this method does not provide any advantage over the original DBSCAN method. To the best of the authors' knowledge, the only (other) library that has implemented the Extract-$\xi$ method for finding $\xi$-clusters is the Environment for Developing KDD-Applications Supported by Index Structures (ELKI) \citep{DBLP:journals/pvldb/SchubertKEZSZ15}. Perhaps much of the complication as to why nearly every statistical computing framework has neglected the Extract-$\xi$ cluster method stems from the fact that the original specification (Figure~19 in~\cite{ankerst1999optics}), while mostly complete, lacks important corrections that otherwise produce artifacts when clustering data~\citep{DBLP:conf/lwa/SchubertG18}. In the original specification of the algorithm, the `dents' of the ordering structure OPTICS produces are scanned for significant changes in reachability (hence the $\xi$ threshold), where clusters are represented by contiguous ranges of points that are distinguished by $1 - \xi$ density-reachability changes in the reachability plot. It is possible, however, after the recursive completion of the \code{update} algorithm (Figure~7 in~\cite{ankerst1999optics}) that the next point processed in the ordering is not actually within the reachability distance of other members of cluster being currently processed. To account for the missing details described above, Erich Schubert introduced a small postprocessing step, first added in the ELKI framework and published much later~\citep{DBLP:conf/lwa/SchubertG18}. This filter corrects the artifacting based on the predecessor of each point~\citep{DBLP:conf/lwa/SchubertG18}, thus improving the $\xi$-cluster method from the original implementation mentioned in the original OPTICS paper. This correction was not introduced until version 0.7.0 of the ELKI framework, released in 2015, 16 years after the original publication of OPTICS and the Extract-$\xi$ method and not published in written form until 2018. \pkg{dbscan} has incorporated these important changes in \code{extractXi()} via the option \code{correctPredecessors} which is by default enabled. %% Not included to keep things simple % To further complicate the status of the \opxi algorithm's existing % implementations, the current ELKI implementation, aside from the predecessor % correction, does not match the original specification of the OPTICS algorithm. % Mentioned by~\cite{ankerst1999optics}, \opxi should not include the last % point of a steep-up area inside of each cluster range\footnote{We alerted the % authors of ELKI to our correction, which is to be included in the next major % release.}. The differences on even a small, randomly generated dataset % are shown on Figures~\ref{fig:dbscan_xi} and \ref{fig:elki_xi} using the % \pkg{dbscan} package result. Thus, \pkg{dbscan} offers complete, a correct % \opxi implementation, true to the original specification. % % \begin{figure} % \centering % \begin{minipage}[t]{0.48\textwidth} % \includegraphics[width=\textwidth]{figures/dbscan_xi_bare} % \caption{Excluding the last point in the steep-up area.} % \label{fig:dbscan_xi} % \end{minipage} % \hfill % \begin{minipage}[t]{0.48\textwidth} % \includegraphics[width=\textwidth]{figures/elki_xi_bare} % \caption{Including the last point in the steep-up area. Note the sharp edges caused by points that are clearly not density-connected to their respective clusters.} % \label{fig:elki_xi} % \end{minipage} % \end{figure} % % Much of the complication stems from the fact that the original specification of the \opxi extraction method defined in the paper (Figure 19 of~\cite{}), while mostly complete, lacks important corrections that otherwise produces many artifacts when clustering data. In the original specification of the \opxi algorithm, points within the ``dents'' of the ordering structure represent collections of spatially dense neighborhoods. Its possible, however, after OPTICS finishes ordering a spatially close cluster, that the next point included in the ordering may not be a member of current cluster (there are no more points in the current cluster to add). This can be remedied by pruning an area of each cluster known as the steep-up area (see Figure 19 in \citep{ankerst1999optics} for details) of points that do not contain predecessors within the same cluster. \end{document} ================================================ FILE: vignettes/dbscan.bib ================================================ @Article{hahsler2019dbscan, title = {{dbscan}: Fast Density-Based Clustering with {R}}, author = {Michael Hahsler and Matthew Piekenbrock and Derek Doran}, journal = {Journal of Statistical Software}, year = {2019}, volume = {91}, number = {1}, pages = {1--30}, doi = {10.18637/jss.v091.i01}, } @inproceedings{ester1996density, title={A density-based algorithm for discovering clusters in large spatial databases with noise.}, author={Ester, Martin and Kriegel, Hans-Peter and Sander, J{\"o}rg and Xu, Xiaowei and others}, booktitle={Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96)}, pages={226--231}, year={1996}, url = {https://dl.acm.org/doi/10.5555/3001460.3001507} } @Manual{dbscan-R, title = {dbscan: Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms}, author = {Michael Hahsler and Matthew Piekenbrock}, note = {R package version 0.9-8.2}, year={2016} } %% Original OPTICS paper %% ----------------------------------------------------------------------------- @inproceedings{ankerst1999optics, title={OPTICS: ordering points to identify the clustering structure}, author={Ankerst, Mihael and Breunig, Markus M and Kriegel, Hans-Peter and Sander, J{\"o}rg}, booktitle={ACM Sigmod Record}, volume={28}, number={2}, pages={49--60}, year={1999}, organization={ACM}, doi = {10.1145/304181.304187} } % OPTICS cluster extraction improvements % ----------------------------------------------------------------------------- @inproceedings{DBLP:conf/lwa/SchubertG18, author = {Erich Schubert and Michael Gertz}, title = {Improving the Cluster Structure Extracted from {OPTICS} Plots}, booktitle = {Lernen, Wissen, Daten, Analysen (LWDA 2018)}, series = {{CEUR} Workshop Proceedings}, volume = {2191}, pages = {318--329}, publisher = {CEUR-WS.org}, year = {2018} } % Original LOF paper % ----------------------------------------------------------------------------- @inproceedings{breunig2000lof, title={LOF: identifying density-based local outliers}, author={Breunig, Markus M and Kriegel, Hans-Peter and Ng, Raymond T and Sander, J{\"o}rg}, booktitle={ACM Int. Conf. on Management of Data}, volume={29}, number={2}, pages={93--104}, year={2000}, organization={ACM}, doi = {10.1145/335191.335388} } % 2003 Reachability <--> Dendrograms Conversions Paper % ----------------------------------------------------------------------------- @inproceedings{sander2003automatic, title={Automatic extraction of clusters from hierarchical clustering representations}, author={Sander, J{\"o}rg and Qin, Xuejie and Lu, Zhiyong and Niu, Nan and Kovarsky, Alex}, booktitle={Pacific-Asia Conference on Knowledge Discovery and Data Mining}, pages={75--87}, year={2003}, organization={Springer} } % Original BIRCH paper % ----------------------------------------------------------------------------- @inproceedings{zhang96, title={BIRCH: an efficient data clustering method for very large databases}, author={Zhang, Tian and Ramakrishnan, Raghu and Livny, Miron}, booktitle={ACM Sigmod Record}, volume={25}, number={2}, pages={103--114}, year={1996}, organization={ACM} } % GDBSCAN Paper (Generalized DBSCAN, by Sanders) % ----------------------------------------------------------------------------- @article{sander1998density, title={Density-based clustering in spatial databases: The algorithm gdbscan and its applications}, author={Sander, J{\"o}rg and Ester, Martin and Kriegel, Hans-Peter and Xu, Xiaowei}, journal={Data mining and knowledge discovery}, volume={2}, number={2}, pages={169--194}, year={1998}, publisher={Springer} } % HDBSCAN* Newest Paper % ----------------------------------------------------------------------------- @article{campello2015hierarchical, title={Hierarchical density estimates for data clustering, visualization, and outlier detection}, author={Campello, Ricardo JGB and Moulavi, Davoud and Zimek, Arthur and Sander, Joerg}, journal={ACM Transactions on Knowledge Discovery from Data (TKDD)}, volume={10}, number={1}, pages={5}, year={2015}, publisher={ACM}, doi = {10.1145/2733381} } % First HDBSCAN* introduction paper, later revised in 2015. The newer one is better. % ----------------------------------------------------------------------------- @inproceedings{campello2013density, title={Density-based clustering based on hierarchical density estimates}, author={Campello, Ricardo JGB and Moulavi, Davoud and Sander, J{\"o}rg}, booktitle={Pacific-Asia Conference on Knowledge Discovery and Data Mining}, pages={160--172}, year={2013}, organization={Springer}, doi = {10.1007/978-3-642-37456-2_14} } % The new-ish 'Standard Methodology' paper of that 'tackles the methodological drawbacks' % of internal clustering validation % ----------------------------------------------------------------------------- @article{gurrutxaga2011towards, title={Towards a standard methodology to evaluate internal cluster validity indices}, author={Gurrutxaga, Ibai and Muguerza, Javier and Arbelaitz, Olatz and P{\'e}rez, Jes{\'u}s M and Mart{\'\i}n, Jos{\'e} I}, journal={Pattern Recognition Letters}, volume={32}, number={3}, pages={505--515}, year={2011}, publisher={Elsevier} } % Original ABACUS - Workaround implementation of mixture modeling for finding % arbitrary shapes % ----------------------------------------------------------------------------- @article{gegick2011abacus, title={ABACUS: mining arbitrary shaped clusters from large datasets based on backbone identification}, author={Gegick, M}, year={2011}, publisher={SIAM} } % Original Silhouette Index Paper % ----------------------------------------------------------------------------- @article{rousseeuw1987silhouettes, title={Silhouettes: a graphical aid to the interpretation and validation of cluster analysis}, author={Rousseeuw, Peter J}, journal={Journal of computational and applied mathematics}, volume={20}, pages={53--65}, year={1987}, publisher={Elsevier} } % Extensive Comparative Study of IVMS % ----------------------------------------------------------------------------- @article{arbelaitz2013extensive, title={An extensive comparative study of cluster validity indices}, author={Arbelaitz, Olatz and Gurrutxaga, Ibai and Muguerza, Javier and P{\'e}rez, Jes{\'u}S M and Perona, I{\~n}Igo}, journal={Pattern Recognition}, volume={46}, number={1}, pages={243--256}, year={2013}, publisher={Elsevier} } % Graph Theory measures for Internal Cluster Validation % ----------------------------------------------------------------------------- @article{pal1997cluster, title={Cluster validation using graph theoretic concepts}, author={Pal, Nikhil R and Biswas, J}, journal={Pattern Recognition}, volume={30}, number={6}, pages={847--857}, year={1997}, publisher={Elsevier} } % Rankings of research papers by citation count; used for showing DBSCAN % popularity % ----------------------------------------------------------------------------- @misc{acade96:online, author = {{Microsoft Academic Search}}, title = {Top publications in data mining}, month = {}, year = {2016}, note = {(Accessed on 08/29/2016)} } @article{PyCluste54:online, doi = {10.21105/joss.01230}, url = {https://doi.org/10.21105/joss.01230}, year = {2019}, publisher = {The Open Journal}, volume = {4}, number = {36}, pages = {1230}, author = {Novikov, Andrei V.}, title = {PyClustering: Data Mining Library}, journal = {Journal of Open Source Software} } % Hartigans convex density estimation model % ----------------------------------------------------------------------------- @article{hartigan1987estimation, title={Estimation of a convex density contour in two dimensions}, author={Hartigan, JA}, journal={Journal of the American Statistical Association}, volume={82}, number={397}, pages={267--270}, year={1987}, publisher={Taylor \& Francis} } % Bentleys Original KDTree Paper % ----------------------------------------------------------------------------- @article{bentley1975multidimensional, title={Multidimensional binary search trees used for associative searching}, author={Bentley, Jon Louis}, journal={Communications of the ACM}, volume={18}, number={9}, pages={509--517}, year={1975}, publisher={ACM} } % Original CLARANS paper % ----------------------------------------------------------------------------- @article{ng2002clarans, title={CLARANS: A method for clustering objects for spatial data mining}, author={Ng, Raymond T. and Han, Jiawei}, journal={IEEE transactions on knowledge and data engineering}, volume={14}, number={5}, pages={1003--1016}, year={2002}, publisher={IEEE} } % Original DENCLUE paper % ----------------------------------------------------------------------------- @inproceedings{hinneburg1998efficient, title={An efficient approach to clustering in large multimedia databases with noise}, author={Hinneburg, Alexander and Keim, Daniel A}, booktitle={KDD}, volume={98}, pages={58--65}, year={1998} } % Original Chameleon Paper % ----------------------------------------------------------------------------- @article{karypis1999chameleon, title={Chameleon: Hierarchical clustering using dynamic modeling}, author={Karypis, George and Han, Eui-Hong and Kumar, Vipin}, journal={Computer}, volume={32}, number={8}, pages={68--75}, year={1999}, publisher={IEEE} } % Original CURE algorithm % ----------------------------------------------------------------------------- @inproceedings{guha1998cure, title={CURE: an efficient clustering algorithm for large databases}, author={Guha, Sudipto and Rastogi, Rajeev and Shim, Kyuseok}, booktitle={ACM SIGMOD Record}, volume={27}, number={2}, pages={73--84}, year={1998}, organization={ACM} } % R statistical computing language citation % ----------------------------------------------------------------------------- @article{team2013r, title={R: A language and environment for statistical computing}, author={Team, R Core and others}, year={2013}, publisher={Vienna, Austria} } % WEKA % ----------------------------------------------------------------------------- @article{hall2009weka, title={The WEKA data mining software: an update}, author={Hall, Mark and Frank, Eibe and Holmes, Geoffrey and Pfahringer, Bernhard and Reutemann, Peter and Witten, Ian H}, journal={ACM SIGKDD explorations newsletter}, volume={11}, number={1}, pages={10--18}, year={2009}, publisher={ACM} } % SPMF Java Machine Learning Library % ----------------------------------------------------------------------------- @article{fournier2014spmf, title={SPMF: a Java open-source pattern mining library.}, author={Fournier-Viger, Philippe and Gomariz, Antonio and Gueniche, Ted and Soltani, Azadeh and Wu, Cheng-Wei and Tseng, Vincent S and others}, journal={Journal of Machine Learning Research}, volume={15}, number={1}, pages={3389--3393}, year={2014} } % Python Scikit Learn % ----------------------------------------------------------------------------- @article{pedregosa2011scikit, title={Scikit-learn: Machine learning in Python}, author={Pedregosa, Fabian and Varoquaux, Ga{\"e}l and Gramfort, Alexandre and Michel, Vincent and Thirion, Bertrand and Grisel, Olivier and Blondel, Mathieu and Prettenhofer, Peter and Weiss, Ron and Dubourg, Vincent and others}, journal={Journal of Machine Learning Research}, volume={12}, number={Oct}, pages={2825--2830}, year={2011} } % MATLAB TOMCAT Toolkit % ----------------------------------------------------------------------------- @article{daszykowski2007tomcat, title={TOMCAT: A MATLAB toolbox for multivariate calibration techniques}, author={Daszykowski, Micha{\l} and Serneels, Sven and Kaczmarek, Krzysztof and Van Espen, Piet and Croux, Christophe and Walczak, Beata}, journal={Chemometrics and intelligent laboratory systems}, volume={85}, number={2}, pages={269--277}, year={2007}, publisher={Elsevier} } % OPTICS code for TOMCAT % ----------------------------------------------------------------------------- @article{daszykowski2002looking, title={Looking for natural patterns in analytical data. 2. Tracing local density with OPTICS}, author={Daszykowski, Michael and Walczak, Beata and Massart, Desire L}, journal={Journal of chemical information and computer sciences}, volume={42}, number={3}, pages={500--507}, year={2002}, publisher={ACS Publications} } % Java ML library % ----------------------------------------------------------------------------- @comment{ Abeel, T.; de Peer, Y. V. & Saeys, Y. Java-ML: A Machine Learning Library, Journal of Machine Learning Research, 2009, 10, 931-934 } @book{abeel2009journal, author = "Abeel, T. ; de Peer and Y. V. and Saeys, Y. Java-ML: A Machine Learning Library", title = "Journal of Machine Learning Research", publisher = "10", pages = "931--934", year = 2009 } % ELKI % ----------------------------------------------------------------------------- @article{DBLP:journals/pvldb/SchubertKEZSZ15, author = {Erich Schubert and Alexander Koos and Tobias Emrich and Andreas Z{\"{u}}fle and Klaus Arthur Schmid and Arthur Zimek}, title = {A Framework for Clustering Uncertain Data}, journal = {{PVLDB}}, volume = {8}, number = {12}, pages = {1976--1979}, year = {2015}, url = {http://www.vldb.org/pvldb/vol8/p1976-schubert.pdf}, timestamp = {Mon, 30 May 2016 12:01:10 +0200}, biburl = {http://dblp.uni-trier.de/rec/bib/journals/pvldb/SchubertKEZSZ15}, bibsource = {dblp computer science bibliography, http://dblp.org} } % BIRCH CRAN records % ----------------------------------------------------------------------------- @misc{CRANPack84:online, author={CRAN}, title = {CRAN - Package birch}, howpublished = {\url{https://cran.r-project.org/web/packages/birch/index.html}}, month = {}, year = {2016}, note = {(Accessed on 09/16/2016)} } % Spectral Clustering % ---------------------------------------------------------------------------- @inproceedings{dhillon2004kernel, title={Kernel k-means: spectral clustering and normalized cuts}, author={Dhillon, Inderjit S and Guan, Yuqiang and Kulis, Brian}, booktitle={Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining}, pages={551--556}, year={2004}, organization={ACM} } % Disjoint-set data structure (2 citations) % ----------------------------------------------------------------------------- @misc{cormen2001introduction, title={Introduction to algorithms second edition}, author={Cormen, Thomas H and Leiserson, Charles E and Rivest, Ronald L and Stein, Clifford}, year={2001}, publisher={The MIT Press} } @inproceedings{patwary2010experiments, title={Experiments on union-find algorithms for the disjoint-set data structure}, author={Patwary, Md Mostofa Ali and Blair, Jean and Manne, Fredrik}, booktitle={International Symposium on Experimental Algorithms}, pages={411--423}, year={2010}, organization={Springer} } % SUBCLU high-dimensional density based clustering % ----------------------------------------------------------------------------- @inproceedings{kailing2004density, title={Density-connected subspace clustering for high-dimensional data}, author={Kailing, Karin and Kriegel, Hans-Peter and Kr{\"o}ger, Peer}, booktitle={Proc. SDM}, volume={4}, year={2004}, organization={SIAM} } % DBSCAN KDD Test of Time award % ----------------------------------------------------------------------------- @misc{SIGKDDNe30:online, author = {SIGKDD}, title = {SIGKDD News : 2014 SIGKDD Test of Time Award}, howpublished = {\url{https://www.kdd.org/News/view/2014-sigkdd-test-of-time-award}}, month = {}, year = {2014}, note = {(Accessed on 10/10/2016)} } % Raftery and Fraley's model-based clustering paper % ----------------------------------------------------------------------------- @article{fraley2002model, title={Model-based clustering, discriminant analysis, and density estimation}, author={Fraley, Chris and Raftery, Adrian E}, journal={Journal of the American statistical Association}, volume={97}, number={458}, pages={611--631}, year={2002}, publisher={Taylor \& Francis} } % FPC: Flexible Procedures for Clustering % ----------------------------------------------------------------------------- @Manual{fpc, title = {fpc: Flexible Procedures for Clustering}, author = {Christian Hennig}, year = {2015}, note = {R package version 2.1-10}, url = {https://CRAN.R-project.org/package=fpc}, } % From the ELKI Benchmarking page % ----------------------------------------------------------------------------- @article{kriegel2016black, title={The (black) art of runtime evaluation: Are we comparing algorithms or implementations?}, author={Kriegel, Hans-Peter and Schubert, Erich and Zimek, Arthur}, journal={Knowledge and Information Systems}, pages={1--38}, year={2016}, publisher={Springer} } % ANN Library % ----------------------------------------------------------------------------- @manual{mount1998ann, title={ANN: library for approximate nearest neighbour searching}, author={Mount, David M and Arya, Sunil}, year={2010}, url = {http://www.cs.umd.edu/~mount/ANN/}, } % Rcpp % ----------------------------------------------------------------------------- @article{eddelbuettel2011rcpp, title={Rcpp: Seamless R and C++ integration}, author={Eddelbuettel, Dirk and Fran{\c{c}}ois, Romain and Allaire, J and Chambers, John and Bates, Douglas and Ushey, Kevin}, journal={Journal of Statistical Software}, volume={40}, number={8}, pages={1--18}, year={2011} } % ST-DBCAN: SpatioTemporal DBSCAN % ----------------------------------------------------------------------------- @article{birant2007st, title={ST-DBSCAN: An algorithm for clustering spatial--temporal data}, author={Birant, Derya and Kut, Alp}, journal={Data \& Knowledge Engineering}, volume={60}, number={1}, pages={208--221}, year={2007}, publisher={Elsevier} } % DBSCAN History (small relative to actual number of extensions) % ----------------------------------------------------------------------------- @inproceedings{rehman2014dbscan, title={DBSCAN: Past, present and future}, author={Rehman, Saif Ur and Asghar, Sohail and Fong, Simon and Sarasvady, S}, booktitle={Applications of Digital Information and Web Technologies (ICADIWT), 2014 Fifth International Conference on the}, pages={232--238}, year={2014}, organization={IEEE} } %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Miscellaneous % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% @article{Gupta2010, abstract = {A key application of clustering data obtained from sources such as microarrays, protein mass spectroscopy, and phylogenetic profiles is the detection of functionally related genes. Typically, only a small number of functionally related genes cluster into one or more groups, and the rest need to be ignored. For such situations, we present Automated Hierarchical Density Shaving (Auto-HDS), a framework that consists of a fast hierarchical density-based clustering algorithm and an unsupervised model selection strategy. Auto-HDS can automatically select clusters of different densities, present them in a compact hierarchy, and rank individual clusters using an innovative stability criteria. Our framework also provides a simple yet powerful 2D visualization of the hierarchy of clusters that is useful for further interactive exploration. We present results on Gasch and Lee microarray data sets to show the effectiveness of our methods. Additional results on other biological data are included in the supplemental material.}, author = {Gupta, Gunjan and Liu, Alexander and Ghosh, Joydeep}, doi = {10.1109/TCBB.2008.32}, file = {:Users/mpiekenbrock/ResearchLibrary/Automated Hierarchical Density Shaving- A Robust Automated Clustering and Visualization Framework for Large Biological Data Sets.pdf:pdf}, isbn = {1557-9964}, issn = {15455963}, journal = {IEEE/ACM Transactions on Computational Biology and Bioinformatics}, keywords = {Bioinformatics,Clustering,Data and knowledge visualization,Mining methods and algorithms}, number = {2}, pages = {223--237}, pmid = {20431143}, title = {{Automated hierarchical density shaving: A robust automated clustering and visualization framework for large biological data sets}}, volume = {7}, year = {2010} } @article{Ssets, author = {P. Fr\"anti and O. Virmajoki}, title = {Iterative shrinking method for clustering problems}, journal = {Pattern Recognition}, year = {2006}, volume = {39}, number = {5}, pages = {761--765} } % Path and Spiral based @article{chang2008robust, title={Robust path-based spectral clustering}, author={Chang, Hong and Yeung, Dit-Yan}, journal={Pattern Recognition}, volume={41}, number={1}, pages={191--203}, year={2008}, publisher={Elsevier} } % Compound dataset @article{zahn1971graph, title={Graph-theoretical methods for detecting and describing gestalt clusters}, author={Zahn, Charles T}, journal={IEEE Transactions on computers}, volume={100}, number={1}, pages={68--86}, year={1971}, publisher={IEEE} } % Aggregation dataset @article{gionis2007clustering, title={Clustering aggregation}, author={Gionis, Aristides and Mannila, Heikki and Tsaparas, Panayiotis}, journal={ACM Transactions on Knowledge Discovery from Data (TKDD)}, volume={1}, number={1}, pages={4}, year={2007}, publisher={ACM} } % R15 dataset @article{veenman2002maximum, title={A maximum variance cluster algorithm}, author={Veenman, Cor J. and Reinders, Marcel J. T. and Backer, Eric}, journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, volume={24}, number={9}, pages={1273--1280}, year={2002}, publisher={IEEE} } @inproceedings{reilly2010detection, title={Detection and tracking of large number of targets in wide area surveillance}, author={Reilly, Vladimir and Idrees, Haroon and Shah, Mubarak}, booktitle={European Conference on Computer Vision}, pages={186--199}, year={2010}, organization={Springer} } @inproceedings{jain2005law, title={Law, Data clustering: a user’s dilemma}, author={Jain, Anil K and Martin, HC}, booktitle={Proceedings of the First international conference on Pattern Recognition and Machine Intelligence}, year={2005} } @article{jain1999review, author = {Jain, A. K. and Murty, M. N. and Flynn, P. J.}, title = {Data Clustering: A Review}, journal = {ACM Computuing Surveys}, issue_date = {Sept. 1999}, volume = {31}, number = {3}, month = sep, year = {1999}, issn = {0360-0300}, pages = {264--323}, numpages = {60}, url = {http://doi.acm.org/10.1145/331499.331504}, doi = {10.1145/331499.331504}, acmid = {331504}, publisher = {ACM}, address = {New York, NY, USA}, } % Flame data set @article{fu2007flame, title={FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data}, author={Fu, Limin and Medico, Enzo}, journal={BMC Bioinformatics}, volume={8}, number={1}, pages={1}, year={2007}, publisher={BioMed Central} } % Birch dataset @article{Birchsets, author = {T. Zhang and R. Ramakrishnan and M. Livny}, title = {BIRCH: A new data clustering algorithm and its applications}, journal = {Data Mining and Knowledge Discovery}, year = {1997}, volume = {1}, number = {2}, pages = {141--182} } @inproceedings{kisilevich2010p, title={P-DBSCAN: a density based clustering algorithm for exploration and analysis of attractive areas using collections of geo-tagged photos}, author={Kisilevich, Slava and Mansmann, Florian and Keim, Daniel}, booktitle={Proceedings of the 1st international conference and exhibition on computing for geospatial research \& application}, pages={38}, year={2010}, organization={ACM} } @inproceedings{celebi2005mining, title={Mining biomedical images with density-based clustering}, author={Celebi, M Emre and Aslandogan, Y Alp and Bergstresser, Paul R}, booktitle={International Conference on Information Technology: Coding and Computing (ITCC'05)-Volume II}, volume={1}, pages={163--168}, year={2005}, organization={IEEE} } @inproceedings{ertoz2003finding, title={Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data.}, author={Ert{\"o}z, Levent and Steinbach, Michael and Kumar, Vipin}, booktitle={SDM}, pages={47--58}, year={2003}, organization={SIAM} } @article{Chen2014, author = {Chen, W and Ji, M H and Wang, J M}, doi = {10.3991/ijoe.v10i6.3881}, file = {:Users/mpiekenbrock/ResearchLibrary/TDBSCAN.pdf:pdf}, issn = {18612121}, journal = {International Journal of Online Engineering}, keywords = {Density-based clustering,Personal travel trajectory,T-DBSCAN,Trip segmentation}, number = {6}, pages = {19--24}, title = {{T-DBSCAN: A spatiotemporal density clustering for GPS trajectory segmentation}}, volume = {10}, year = {2014} } @incollection{sander2011density, title={Density-based clustering}, author={Sander, Joerg}, booktitle={Encyclopedia of Machine Learning}, pages={270--273}, year={2011}, publisher={Springer} } % 88 citations @article{verma2012comparative, title={A comparative study of various clustering algorithms in data mining}, author={Verma, Manish and Srivastava, Mauly and Chack, Neha and Diswar, Atul Kumar and Gupta, Nidhi}, journal={International Journal of Engineering Research and Applications (IJERA)}, volume={2}, number={3}, pages={1379--1384}, year={2012} } @inproceedings{roy2005approach, title={An approach to find embedded clusters using density based techniques}, author={Roy, Swarup and Bhattacharyya, DK}, booktitle={International Conference on Distributed Computing and Internet Technology}, pages={523--535}, year={2005}, organization={Springer} } @inproceedings{chowdhury2010efficient, title={An efficient method for subjectively choosing parameter ‘k’automatically in VDBSCAN (Varied Density Based Spatial Clustering of Applications with Noise) algorithm}, author={Chowdhury, AK M Rasheduzzaman and Mollah, Md Elias and Rahman, Md Asikur}, booktitle={Computer and Automation Engineering (ICCAE), 2010 The 2nd International Conference on}, volume={1}, pages={38--41}, year={2010}, organization={IEEE} } @inproceedings{ghanbarpour2014exdbscan, title={EXDBSCAN: An extension of DBSCAN to detect clusters in multi-density datasets}, author={Ghanbarpour, Asieh and Minaei, Behrooz}, booktitle={Intelligent Systems (ICIS), 2014 Iranian Conference on}, pages={1--5}, year={2014}, organization={IEEE} } @inproceedings{vijayalakshmi2010improved, title={Improved varied density based spatial clustering algorithm with noise}, author={Vijayalakshmi, S and Punithavalli, M}, booktitle={Computational Intelligence and Computing Research (ICCIC), 2010 IEEE International Conference on}, pages={1--4}, year={2010}, organization={IEEE} } @article{Wang2013, author = {Wang, Wei}, file = {:Users/mpiekenbrock/Downloads/905067f5314e6073d4779c11572bd8c5.pdf:pdf}, isbn = {978-0-9891305-0-9}, keywords = {clustering algorithm,clustering techniques,data mining,derivative,global optimum k,similarity,similarity and minimizes intergroup,there are four basic,vdbscan}, pages = {225--228}, title = {{Improved VDBSCAN With Global Optimum K}}, year = {2013} } @article{parvez2012data, title={Data set property based ‘K’in VDBSCAN Clustering Algorithm}, author={Parvez, Abu Wahid Md Masud}, journal={World of Computer Science and Information Technology Journal (WCSIT)}, volume={2}, number={3}, pages={115--119}, year={2012} } @inproceedings{liu2007vdbscan, title={VDBSCAN: varied density based spatial clustering of applications with noise}, author={Liu, Peng and Zhou, Dong and Wu, Naijun}, booktitle={2007 International conference on service systems and service management}, pages={1--4}, year={2007}, organization={IEEE} } @article{pei2009decode, title={DECODE: a new method for discovering clusters of different densities in spatial data}, author={Pei, Tao and Jasra, Ajay and Hand, David J and Zhu, A-Xing and Zhou, Chenghu}, journal={Data Mining and Knowledge Discovery}, volume={18}, number={3}, pages={337--369}, year={2009}, publisher={Springer} } @article{duan2007local, title={A local-density based spatial clustering algorithm with noise}, author={Duan, Lian and Xu, Lida and Guo, Feng and Lee, Jun and Yan, Baopin}, journal={Information Systems}, volume={32}, number={7}, pages={978--986}, year={2007}, publisher={Elsevier} } @inproceedings{li2007traffic, title={Traffic density-based discovery of hot routes in road networks}, author={Li, Xiaolei and Han, Jiawei and Lee, Jae-Gil and Gonzalez, Hector}, booktitle={International Symposium on Spatial and Temporal Databases}, pages={441--459}, year={2007}, organization={Springer} } @article{tran2006knn, title={KNN-kernel density-based clustering for high-dimensional multivariate data}, author={Tran, Thanh N and Wehrens, Ron and Buydens, Lutgarde MC}, journal={Computational Statistics \& Data Analysis}, volume={51}, number={2}, pages={513--525}, year={2006}, publisher={Elsevier} } @inproceedings{jiang2003dhc, title={DHC: a density-based hierarchical clustering method for time series gene expression data}, author={Jiang, Daxin and Pei, Jian and Zhang, Aidong}, booktitle={Bioinformatics and Bioengineering, 2003. Proceedings. Third IEEE Symposium on}, pages={393--400}, year={2003}, organization={IEEE} } @inproceedings{kriegel2005density, title={Density-based clustering of uncertain data}, author={Kriegel, Hans-Peter and Pfeifle, Martin}, booktitle={Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining}, pages={672--677}, year={2005}, organization={ACM} } @book{agrawal1998automatic, title={Automatic subspace clustering of high dimensional data for data mining applications}, author={Agrawal, Rakesh and Gehrke, Johannes and Gunopulos, Dimitrios and Raghavan, Prabhakar}, volume={27}, number={2}, year={1998}, publisher={ACM} } @inproceedings{cao2006density, title={Density-Based Clustering over an Evolving Data Stream with Noise.}, author={Cao, Feng and Ester, Martin and Qian, Weining and Zhou, Aoying}, booktitle={SDM}, volume={6}, pages={328--339}, year={2006}, organization={SIAM} } @inproceedings{chen2007density, title={Density-based clustering for real-time stream data}, author={Chen, Yixin and Tu, Li}, booktitle={Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining}, pages={133--142}, year={2007}, organization={ACM} } @article{kriegel:2011, title={Density-based clustering}, author={Kriegel, Hans-Peter and Kr{\"o}ger, Peer and Sander, J{\"o}rg and Zimek Arthur}, journal={Wires Data and Knowledge Discovery}, volume={1}, number={}, pages={231--240}, year={2011}, publisher={John Wiley \& Sons} } @book{Aggarwal:2013, author = {Aggarwal, Charu C. and Reddy, Chandan K.}, title = {Data Clustering: Algorithms and Applications}, year = {2013}, isbn = {1466558210, 9781466558212}, edition = {1st}, publisher = {Chapman \& Hall/CRC}, } @book{Kaufman:1990, title = "Finding groups in data : an introduction to cluster analysis", author = "Kaufman, Leonard and Rousseeuw, Peter J.", series = "Wiley series in probability and mathematical statistics", publisher = "Wiley", address = "New York", isbn = "0-471-87876-6", year = 1990 } @ARTICLE{jarvis1973, author={Jarvis, R.A. and Patrick, E.A.}, journal={IEEE Transactions on Computers}, title={Clustering Using a Similarity Measure Based on Shared Near Neighbors}, year={1973}, volume={C-22}, number={11}, pages={1025-1034}, keywords={Clustering, nonparametric, pattern recognition, shared near neighbors, similarity measure.}, doi={10.1109/T-C.1973.223640} } @inbook{erdoz2003, author = {Levent Ertöz and Michael Steinbach and Vipin Kumar}, title = {Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data}, booktitle = {Proceedings of the 2003 SIAM International Conference on Data Mining (SDM)}, year = {2003}, pages = {47-58}, doi = {10.1137/1.9781611972733.5} } @inbook{moulavi2014, author = {Davoud Moulavi and Pablo A. Jaskowiak and Ricardo J. G. B. Campello and Arthur Zimek and Jörg Sander}, title = {Density-Based Clustering Validation}, booktitle = {Proceedings of the 2014 SIAM International Conference on Data Mining (SDM)}, year = {2014}, pages = {839-847}, doi = {10.1137/1.9781611973440.96}, } ================================================ FILE: vignettes/hdbscan.Rmd ================================================ --- title: "HDBSCAN with the dbscan package" author: "Matt Piekenbrock, Michael Hahsler" vignette: > %\VignetteIndexEntry{Hierarchical DBSCAN (HDBSCAN) with the dbscan package} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} header-includes: \usepackage{animation} output: html_document --- The dbscan package [6] includes a fast implementation of Hierarchical DBSCAN (HDBSCAN) and its related algorithm(s) for the R platform. This vignette introduces how to interface with these features. To understand how HDBSCAN works, we refer to an excellent Python Notebook resource that goes over the basic concepts of the algorithm (see [ the SciKit-learn docs](http://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html)). For the sake of simplicity, consider the same sample dataset from the notebook: ```{r} library("dbscan") data("moons") plot(moons, pch=20) ``` To run the HDBSCAN algorithm, simply pass the dataset and the (single) parameter value 'minPts' to the hdbscan function. ```{r} cl <- hdbscan(moons, minPts = 5) cl ``` The 'flat' results are stored in the 'cluster' member. Noise points are given a value of 0, so increment by 1. ```{r} plot(moons, col=cl$cluster+1, pch=20) ``` The results match intuitive notions of what 'similar' clusters may look like when they manifest in arbitrary shapes. ## Hierarchical DBSCAN The resulting HDBSCAN object contains a hierarchical representation of every possible DBSCAN* clustering. This hierarchical representation is compactly stored in the familiar 'hc' member of the resulting HDBSCAN object, in the same format of traditional hierarchical clustering objects formed using the 'hclust' method from the stats package. ```{r} cl$hc ``` Note that although this object is available for use with any of the methods that work with 'hclust' objects, the distance method HDBSCAN uses (mutual reachability distance, see [2]) is _not_ an available method of the hclust function. This hierarchy, denoted the "HDBSCAN* hierarchy" in [3], can be visualized using the built-in plotting method from the stats package ```{r} plot(cl$hc, main="HDBSCAN* Hierarchy") ``` ## DBSCAN\* vs cutting the HDBSCAN\* tree As the name implies, the fascinating thing about the HDBSCAN\* hierarchy is that any global 'cut' is equivalent to running DBSCAN\* (DBSCAN w/o border points) at the tree's cutting threshold $eps$ (assuming the same $minPts$ parameter setting was used). But can this be verified manually? Using a modified function to distinguish noise using core distance as 0 (since the stats cutree method _does not_ assign singletons with 0), the results can be shown to be identical. ```{r} cl <- hdbscan(moons, minPts = 5) check <- rep(FALSE, nrow(moons)-1) core_dist <- kNNdist(moons, k=5-1) ## cutree doesn't distinguish noise as 0, so we make a new method to do it manually cut_tree <- function(hcl, eps, core_dist){ cuts <- unname(cutree(hcl, h=eps)) cuts[which(core_dist > eps)] <- 0 # Use core distance to distinguish noise cuts } eps_values <- sort(cl$hc$height, decreasing = TRUE)+.Machine$double.eps ## Machine eps for consistency between cuts for (i in 1:length(eps_values)) { cut_cl <- cut_tree(cl$hc, eps_values[i], core_dist) dbscan_cl <- dbscan(moons, eps = eps_values[i], minPts = 5, borderPoints = FALSE) # DBSCAN* doesn't include border points ## Use run length encoding as an ID-independent way to check ordering check[i] <- (all.equal(rle(cut_cl)$lengths, rle(dbscan_cl$cluster)$lengths) == "TRUE") } print(all(check == TRUE)) ``` ## Simplified Tree The HDBSCAN\* hierarchy is useful, but for larger datasets it can become overly cumbersome since every data point is represented as a leaf somewhere in the hierarchy. The hdbscan object comes with a powerful visualization tool that plots the 'simplified' hierarchy(see [2] for more details), which shows __cluster-wide__ changes over an infinite number of $eps$ thresholds. It is the default visualization dispatched by the 'plot' method ```{r} plot(cl) ``` You can change up colors ```{r} plot(cl, gradient = c("yellow", "orange", "red", "blue")) ``` ... and scale the widths for individual devices appropriately ```{r} plot(cl, gradient = c("purple", "blue", "green", "yellow"), scale=1.5) ``` ... even outline the most 'stable' clusters reported in the flat solution ```{r} plot(cl, gradient = c("purple", "blue", "green", "yellow"), show_flat = TRUE) ``` ## Cluster Stability Scores Note the stability scores correspond to the labels on the condensed tree, but the cluster assignments in the cluster member element do not correspond to the labels in the condensed tree. Also, note that these scores represent the stability scores _before_ the traversal up the tree that updates the scores based on the children. ```{r} print(cl$cluster_scores) ``` The individual point membership 'probabilities' are in the probabilities member element ```{r} head(cl$membership_prob) ``` These can be used to show the 'degree of cluster membership' by, for example, plotting points with transparencies that correspond to their membership degrees. ```{r} plot(moons, col=cl$cluster+1, pch=21) colors <- mapply(function(col, i) adjustcolor(col, alpha.f = cl$membership_prob[i]), palette()[cl$cluster+1], seq_along(cl$cluster)) points(moons, col=colors, pch=20) ``` ## Global-Local Outlier Score from Hierarchies A recent journal publication on HDBSCAN comes with a new outlier measure that computes an outlier score of each point in the data based on local _and_ global properties of the hierarchy, defined as the Global-Local Outlier Score from Hierarchies (GLOSH)[4]. An example of this is shown below, where unlike the membership probabilities, the opacity of point represents the amount of "outlierness" the point represents. Traditionally, outliers are generally considered to be observations that deviate from the expected value of their presumed underlying distribution, where the measure of deviation that is considered significant is determined by some statistical threshold value. __Note:__ Because of the distinction made that noise points, points that _are not_ assigned to any clusters, should be considered in the definition of an outlier, the outlier scores computed are not just the inversely-proportional scores to the membership probabilities. ```{r} top_outliers <- order(cl$outlier_scores, decreasing = TRUE)[1:10] colors <- mapply(function(col, i) adjustcolor(col, alpha.f = cl$outlier_scores[i]), palette()[cl$cluster+1], seq_along(cl$cluster)) plot(moons, col=colors, pch=20) text(moons[top_outliers, ], labels = top_outliers, pos=3) ``` ## A Larger Clustering Example A larger example dataset may be more beneficial in explicitly revealing the usefulness of HDSBCAN. Consider the 'DS3' dataset originally published as part of a benchmark test dataset for the Chameleon clustering algorithm [5]. It's clear that the shapes in this dataset can be distinguished sufficiently well by a human, however, it is well known that many clustering algorithms fail to capture the intuitive structure. ```{r} data("DS3") plot(DS3, pch=20, cex=0.25) ``` Using the single parameter setting of, say, 25, HDBSCAN finds 6 clusters ```{r} cl2 <- hdbscan(DS3, minPts = 25) cl2 ``` Marking the noise appropriately and highlighting points based on their 'membership probabilities' as before, a visualization of the cluster structure can be easily crafted. ```{r} plot(DS3, col=cl2$cluster+1, pch=ifelse(cl2$cluster == 0, 8, 1), # Mark noise as star cex=ifelse(cl2$cluster == 0, 0.5, 0.75), # Decrease size of noise xlab=NA, ylab=NA) colors <- sapply(1:length(cl2$cluster), function(i) adjustcolor(palette()[(cl2$cluster+1)[i]], alpha.f = cl2$membership_prob[i])) points(DS3, col=colors, pch=20) ``` The simplified tree can be particularly useful for larger datasets ```{r} plot(cl2, scale = 3, gradient = c("purple", "orange", "red"), show_flat = TRUE) ``` ## Performance All of the computational and memory intensive tasks required by HDSBCAN were written in C++ using the Rcpp package. With DBSCAN, the performance depends on the parameter settings, primarily on the radius at which points are considered as candidates for clustering ('eps'), and generally less so on the 'minPts' parameter. Intuitively, larger values of eps increase the computation time. One of the primary computational bottleneck with using HDBSCAN is the computation of the full (euclidean) pairwise distance between all points, for which HDBSCAN currently relies on base R 'dist' method for. If a precomputed one is available, the running time of HDBSCAN can be moderately reduced. ## References 1. Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Institute for Computer Science, University of Munich. Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96). 2. Campello, Ricardo JGB, Davoud Moulavi, Arthur Zimek, and Jörg Sander. "A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies." Data Mining and Knowledge Discovery 27, no. 3 (2013): 344-371. 3. Campello, Ricardo JGB, Davoud Moulavi, and Joerg Sander. "Density-based clustering based on hierarchical density estimates." In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 160-172. Springer Berlin Heidelberg, 2013. 4. Campello, Ricardo JGB, Davoud Moulavi, Arthur Zimek, and Jörg Sander. "Hierarchical density estimates for data clustering, visualization, and outlier detection." ACM Transactions on Knowledge Discovery from Data (TKDD) 10, no. 1 (2015): 5. 5. Karypis, George, Eui-Hong Han, and Vipin Kumar. "Chameleon: Hierarchical clustering using dynamic modeling." Computer 32, no. 8 (1999): 68-75. 6. Hahsler M, Piekenbrock M, Doran D (2019). "dbscan: Fast Density-Based Clustering with R." Journal of Statistical Software, 91(1), 1-30. doi: [10.18637/jss.v091.i01](https://doi.org/10.18637/jss.v091.i01)