Full Code of mhahsler/dbscan for AI

master 111f9bc6a376 cached

154 files

962.1 KB

309.7k tokens

185 symbols

1 requests

Download .txt

Showing preview only (1,008K chars total). Download the full file or copy to clipboard to get everything.

Repository: mhahsler/dbscan
Branch: master
Commit: 111f9bc6a376
Files: 154
Total size: 962.1 KB

Directory structure:
gitextract_jkl9o70t/

├── .Rbuildignore
├── .github/
│   └── .gitignore
├── .gitignore
├── DESCRIPTION
├── LICENSE
├── NAMESPACE
├── NEWS.md
├── R/
│   ├── AAA_dbscan-package.R
│   ├── AAA_definitions.R
│   ├── DBCV_datasets.R
│   ├── DS3.R
│   ├── GLOSH.R
│   ├── LOF.R
│   ├── NN.R
│   ├── RcppExports.R
│   ├── broom-dbscan-tidiers.R
│   ├── comps.R
│   ├── dbcv.R
│   ├── dbscan.R
│   ├── dendrogram.R
│   ├── extractFOSC.R
│   ├── frNN.R
│   ├── hdbscan.R
│   ├── hullplot.R
│   ├── jpclust.R
│   ├── kNN.R
│   ├── kNNdist.R
│   ├── moons.R
│   ├── ncluster.R
│   ├── nobs.R
│   ├── optics.R
│   ├── pointdensity.R
│   ├── predict.R
│   ├── reachability.R
│   ├── sNN.R
│   ├── sNNclust.R
│   ├── utils.R
│   └── zzz.R
├── README.Rmd
├── README.md
├── data/
│   ├── DS3.rdata
│   ├── Dataset_1.rda
│   ├── Dataset_2.rda
│   ├── Dataset_3.rda
│   ├── Dataset_4.rda
│   └── moons.rdata
├── data_src/
│   ├── data_DBCV/
│   │   ├── dataset_1.txt
│   │   ├── dataset_2.txt
│   │   ├── dataset_3.txt
│   │   ├── dataset_4.txt
│   │   ├── read_data.R
│   │   └── test_DBCV.R
│   └── data_chameleon/
│       └── read.R
├── dbscan.Rproj
├── inst/
│   └── CITATION
├── man/
│   ├── DBCV_datasets.Rd
│   ├── DS3.Rd
│   ├── NN.Rd
│   ├── comps.Rd
│   ├── dbcv.Rd
│   ├── dbscan-package.Rd
│   ├── dbscan.Rd
│   ├── dbscan_tidiers.Rd
│   ├── dendrogram.Rd
│   ├── extractFOSC.Rd
│   ├── frNN.Rd
│   ├── glosh.Rd
│   ├── hdbscan.Rd
│   ├── hullplot.Rd
│   ├── jpclust.Rd
│   ├── kNN.Rd
│   ├── kNNdist.Rd
│   ├── lof.Rd
│   ├── moons.Rd
│   ├── ncluster.Rd
│   ├── optics.Rd
│   ├── pointdensity.Rd
│   ├── reachability.Rd
│   ├── sNN.Rd
│   └── sNNclust.Rd
├── src/
│   ├── ANN/
│   │   ├── ANN.cpp
│   │   ├── ANN.h
│   │   ├── ANNperf.h
│   │   ├── ANNx.h
│   │   ├── Copyright.txt
│   │   ├── License.txt
│   │   ├── ReadMe.txt
│   │   ├── bd_fix_rad_search.cpp
│   │   ├── bd_pr_search.cpp
│   │   ├── bd_search.cpp
│   │   ├── bd_tree.cpp
│   │   ├── bd_tree.h
│   │   ├── brute.cpp
│   │   ├── kd_dump.cpp
│   │   ├── kd_fix_rad_search.cpp
│   │   ├── kd_fix_rad_search.h
│   │   ├── kd_pr_search.cpp
│   │   ├── kd_pr_search.h
│   │   ├── kd_search.cpp
│   │   ├── kd_search.h
│   │   ├── kd_split.cpp
│   │   ├── kd_split.h
│   │   ├── kd_tree.cpp
│   │   ├── kd_tree.h
│   │   ├── kd_util.cpp
│   │   ├── kd_util.h
│   │   ├── perf.cpp
│   │   ├── pr_queue.h
│   │   └── pr_queue_k.h
│   ├── JP.cpp
│   ├── Makevars
│   ├── RcppExports.cpp
│   ├── UnionFind.cpp
│   ├── UnionFind.h
│   ├── cleanup.cpp
│   ├── connectedComps.cpp
│   ├── dbcv.cpp
│   ├── dbscan.cpp
│   ├── dendrogram.cpp
│   ├── density.cpp
│   ├── frNN.cpp
│   ├── hdbscan.cpp
│   ├── kNN.cpp
│   ├── kNN.h
│   ├── lof.cpp
│   ├── lt.h
│   ├── mrd.cpp
│   ├── mst.cpp
│   ├── mst.h
│   ├── optics.cpp
│   ├── regionQuery.cpp
│   ├── regionQuery.h
│   ├── utilities.cpp
│   └── utilities.h
├── tests/
│   ├── testthat/
│   │   ├── fixtures/
│   │   │   ├── elki_optics.rda
│   │   │   ├── elki_optics_xi.rda
│   │   │   └── test_data.rda
│   │   ├── test-dbcv.R
│   │   ├── test-dbscan.R
│   │   ├── test-fosc.R
│   │   ├── test-frNN.R
│   │   ├── test-hdbscan.R
│   │   ├── test-kNN.R
│   │   ├── test-kNNdist.R
│   │   ├── test-lof.R
│   │   ├── test-mst.R
│   │   ├── test-optics.R
│   │   ├── test-opticsXi.R
│   │   ├── test-predict.R
│   │   └── test-sNN.R
│   └── testthat.R
└── vignettes/
    ├── dbscan.Rnw
    ├── dbscan.bib
    └── hdbscan.Rmd

================================================
FILE CONTENTS
================================================

================================================
FILE: .Rbuildignore
================================================
proj$
^\.Rproj\.user$
^cran-comments\.md$
^appveyor\.yml$
^revdep$
^.*\.o$
^.*\.Rproj$
^LICENSE
README.Rmd
data_src
ignore
^\.github$


================================================
FILE: .github/.gitignore
================================================
*.html


================================================
FILE: .gitignore
================================================
# Generated files 
*.o
*.so

# History files
.Rhistory
.Rapp.history
.RData
*.Rcheck


# Example code in package build process
*-Ex.R

# RStudio files
.Rproj.user/

# produced vignettes
vignettes/*.html
vignettes/*.pdf
.Rproj.user

# OS stuff 
.DS*

# Personal work directories 
Work
ignore
jss


================================================
FILE: DESCRIPTION
================================================
Package: dbscan
Title: Density-Based Spatial Clustering of Applications with Noise
    (DBSCAN) and Related Algorithms
Version: 1.2.4
Date: 2025-12-18
Authors@R: c(
    person("Michael", "Hahsler", email = "mhahsler@lyle.smu.edu", 
           role = c("aut", "cre", "cph"),
           comment = c(ORCID = "0000-0003-2716-1405")),
    person("Matthew", "Piekenbrock", role = c("aut", "cph")),
    person("Sunil", "Arya", role = c("ctb", "cph")),
    person("David", "Mount", role = c("ctb", "cph")),
    person("Claudia", "Malzer", role = "ctb")
  )
Description: A fast reimplementation of several density-based algorithms
    of the DBSCAN family. Includes the clustering algorithms DBSCAN
    (density-based spatial clustering of applications with noise) and
    HDBSCAN (hierarchical DBSCAN), the ordering algorithm OPTICS (ordering
    points to identify the clustering structure), shared nearest neighbor
    clustering, and the outlier detection algorithms LOF (local outlier
    factor) and GLOSH (global-local outlier score from hierarchies). The
    implementations use the kd-tree data structure (from library ANN) for
    faster k-nearest neighbor search. An R interface to fast kNN and
    fixed-radius NN search is also provided.  Hahsler, Piekenbrock and
    Doran (2019) <doi:10.18637/jss.v091.i01>.
License: GPL (>= 2)
URL: https://github.com/mhahsler/dbscan
BugReports: https://github.com/mhahsler/dbscan/issues
Depends:
    R (>= 3.2.0)
Imports:
    generics,
    graphics,
    Rcpp (>= 1.0.0),
    stats
Suggests:
    dendextend,
    fpc,
    igraph,
    knitr,
    microbenchmark,
    rmarkdown,
    testthat (>= 3.0.0),
    tibble
LinkingTo: 
    Rcpp
VignetteBuilder: 
    knitr
Config/testthat/edition: 3
Copyright: ANN library is copyright by University of Maryland, Sunil Arya
    and David Mount. All other code is copyright by Michael Hahsler and
    Matthew Piekenbrock.
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.3.3


================================================
FILE: LICENSE
================================================
                    GNU GENERAL PUBLIC LICENSE
                       Version 3, 29 June 2007

 Copyright (C) 2007 Free Software Foundation, Inc. <http://fsf.org/>
 Everyone is permitted to copy and distribute verbatim copies
 of this license document, but changing it is not allowed.

                            Preamble

  The GNU General Public License is a free, copyleft license for
software and other kinds of works.

  The licenses for most software and other practical works are designed
to take away your freedom to share and change the works.  By contrast,
the GNU General Public License is intended to guarantee your freedom to
share and change all versions of a program--to make sure it remains free
software for all its users.  We, the Free Software Foundation, use the
GNU General Public License for most of our software; it applies also to
any other work released this way by its authors.  You can apply it to
your programs, too.

  When we speak of free software, we are referring to freedom, not
price.  Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
them if you wish), that you receive source code or can get it if you
want it, that you can change the software or use pieces of it in new
free programs, and that you know you can do these things.

  To protect your rights, we need to prevent others from denying you
these rights or asking you to surrender the rights.  Therefore, you have
certain responsibilities if you distribute copies of the software, or if
you modify it: responsibilities to respect the freedom of others.

  For example, if you distribute copies of such a program, whether
gratis or for a fee, you must pass on to the recipients the same
freedoms that you received.  You must make sure that they, too, receive
or can get the source code.  And you must show them these terms so they
know their rights.

  Developers that use the GNU GPL protect your rights with two steps:
(1) assert copyright on the software, and (2) offer you this License
giving you legal permission to copy, distribute and/or modify it.

  For the developers' and authors' protection, the GPL clearly explains
that there is no warranty for this free software.  For both users' and
authors' sake, the GPL requires that modified versions be marked as
changed, so that their problems will not be attributed erroneously to
authors of previous versions.

  Some devices are designed to deny users access to install or run
modified versions of the software inside them, although the manufacturer
can do so.  This is fundamentally incompatible with the aim of
protecting users' freedom to change the software.  The systematic
pattern of such abuse occurs in the area of products for individuals to
use, which is precisely where it is most unacceptable.  Therefore, we
have designed this version of the GPL to prohibit the practice for those
products.  If such problems arise substantially in other domains, we
stand ready to extend this provision to those domains in future versions
of the GPL, as needed to protect the freedom of users.

  Finally, every program is threatened constantly by software patents.
States should not allow patents to restrict development and use of
software on general-purpose computers, but in those that do, we wish to
avoid the special danger that patents applied to a free program could
make it effectively proprietary.  To prevent this, the GPL assures that
patents cannot be used to render the program non-free.

  The precise terms and conditions for copying, distribution and
modification follow.

                       TERMS AND CONDITIONS

  0. Definitions.

  "This License" refers to version 3 of the GNU General Public License.

  "Copyright" also means copyright-like laws that apply to other kinds of
works, such as semiconductor masks.

  "The Program" refers to any copyrightable work licensed under this
License.  Each licensee is addressed as "you".  "Licensees" and
"recipients" may be individuals or organizations.

  To "modify" a work means to copy from or adapt all or part of the work
in a fashion requiring copyright permission, other than the making of an
exact copy.  The resulting work is called a "modified version" of the
earlier work or a work "based on" the earlier work.

  A "covered work" means either the unmodified Program or a work based
on the Program.

  To "propagate" a work means to do anything with it that, without
permission, would make you directly or secondarily liable for
infringement under applicable copyright law, except executing it on a
computer or modifying a private copy.  Propagation includes copying,
distribution (with or without modification), making available to the
public, and in some countries other activities as well.

  To "convey" a work means any kind of propagation that enables other
parties to make or receive copies.  Mere interaction with a user through
a computer network, with no transfer of a copy, is not conveying.

  An interactive user interface displays "Appropriate Legal Notices"
to the extent that it includes a convenient and prominently visible
feature that (1) displays an appropriate copyright notice, and (2)
tells the user that there is no warranty for the work (except to the
extent that warranties are provided), that licensees may convey the
work under this License, and how to view a copy of this License.  If
the interface presents a list of user commands or options, such as a
menu, a prominent item in the list meets this criterion.

  1. Source Code.

  The "source code" for a work means the preferred form of the work
for making modifications to it.  "Object code" means any non-source
form of a work.

  A "Standard Interface" means an interface that either is an official
standard defined by a recognized standards body, or, in the case of
interfaces specified for a particular programming language, one that
is widely used among developers working in that language.

  The "System Libraries" of an executable work include anything, other
than the work as a whole, that (a) is included in the normal form of
packaging a Major Component, but which is not part of that Major
Component, and (b) serves only to enable use of the work with that
Major Component, or to implement a Standard Interface for which an
implementation is available to the public in source code form.  A
"Major Component", in this context, means a major essential component
(kernel, window system, and so on) of the specific operating system
(if any) on which the executable work runs, or a compiler used to
produce the work, or an object code interpreter used to run it.

  The "Corresponding Source" for a work in object code form means all
the source code needed to generate, install, and (for an executable
work) run the object code and to modify the work, including scripts to
control those activities.  However, it does not include the work's
System Libraries, or general-purpose tools or generally available free
programs which are used unmodified in performing those activities but
which are not part of the work.  For example, Corresponding Source
includes interface definition files associated with source files for
the work, and the source code for shared libraries and dynamically
linked subprograms that the work is specifically designed to require,
such as by intimate data communication or control flow between those
subprograms and other parts of the work.

  The Corresponding Source need not include anything that users
can regenerate automatically from other parts of the Corresponding
Source.

  The Corresponding Source for a work in source code form is that
same work.

  2. Basic Permissions.

  All rights granted under this License are granted for the term of
copyright on the Program, and are irrevocable provided the stated
conditions are met.  This License explicitly affirms your unlimited
permission to run the unmodified Program.  The output from running a
covered work is covered by this License only if the output, given its
content, constitutes a covered work.  This License acknowledges your
rights of fair use or other equivalent, as provided by copyright law.

  You may make, run and propagate covered works that you do not
convey, without conditions so long as your license otherwise remains
in force.  You may convey covered works to others for the sole purpose
of having them make modifications exclusively for you, or provide you
with facilities for running those works, provided that you comply with
the terms of this License in conveying all material for which you do
not control copyright.  Those thus making or running the covered works
for you must do so exclusively on your behalf, under your direction
and control, on terms that prohibit them from making any copies of
your copyrighted material outside their relationship with you.

  Conveying under any other circumstances is permitted solely under
the conditions stated below.  Sublicensing is not allowed; section 10
makes it unnecessary.

  3. Protecting Users' Legal Rights From Anti-Circumvention Law.

  No covered work shall be deemed part of an effective technological
measure under any applicable law fulfilling obligations under article
11 of the WIPO copyright treaty adopted on 20 December 1996, or
similar laws prohibiting or restricting circumvention of such
measures.

  When you convey a covered work, you waive any legal power to forbid
circumvention of technological measures to the extent such circumvention
is effected by exercising rights under this License with respect to
the covered work, and you disclaim any intention to limit operation or
modification of the work as a means of enforcing, against the work's
users, your or third parties' legal rights to forbid circumvention of
technological measures.

  4. Conveying Verbatim Copies.

  You may convey verbatim copies of the Program's source code as you
receive it, in any medium, provided that you conspicuously and
appropriately publish on each copy an appropriate copyright notice;
keep intact all notices stating that this License and any
non-permissive terms added in accord with section 7 apply to the code;
keep intact all notices of the absence of any warranty; and give all
recipients a copy of this License along with the Program.

  You may charge any price or no price for each copy that you convey,
and you may offer support or warranty protection for a fee.

  5. Conveying Modified Source Versions.

  You may convey a work based on the Program, or the modifications to
produce it from the Program, in the form of source code under the
terms of section 4, provided that you also meet all of these conditions:

    a) The work must carry prominent notices stating that you modified
    it, and giving a relevant date.

    b) The work must carry prominent notices stating that it is
    released under this License and any conditions added under section
    7.  This requirement modifies the requirement in section 4 to
    "keep intact all notices".

    c) You must license the entire work, as a whole, under this
    License to anyone who comes into possession of a copy.  This
    License will therefore apply, along with any applicable section 7
    additional terms, to the whole of the work, and all its parts,
    regardless of how they are packaged.  This License gives no
    permission to license the work in any other way, but it does not
    invalidate such permission if you have separately received it.

    d) If the work has interactive user interfaces, each must display
    Appropriate Legal Notices; however, if the Program has interactive
    interfaces that do not display Appropriate Legal Notices, your
    work need not make them do so.

  A compilation of a covered work with other separate and independent
works, which are not by their nature extensions of the covered work,
and which are not combined with it such as to form a larger program,
in or on a volume of a storage or distribution medium, is called an
"aggregate" if the compilation and its resulting copyright are not
used to limit the access or legal rights of the compilation's users
beyond what the individual works permit.  Inclusion of a covered work
in an aggregate does not cause this License to apply to the other
parts of the aggregate.

  6. Conveying Non-Source Forms.

  You may convey a covered work in object code form under the terms
of sections 4 and 5, provided that you also convey the
machine-readable Corresponding Source under the terms of this License,
in one of these ways:

    a) Convey the object code in, or embodied in, a physical product
    (including a physical distribution medium), accompanied by the
    Corresponding Source fixed on a durable physical medium
    customarily used for software interchange.

    b) Convey the object code in, or embodied in, a physical product
    (including a physical distribution medium), accompanied by a
    written offer, valid for at least three years and valid for as
    long as you offer spare parts or customer support for that product
    model, to give anyone who possesses the object code either (1) a
    copy of the Corresponding Source for all the software in the
    product that is covered by this License, on a durable physical
    medium customarily used for software interchange, for a price no
    more than your reasonable cost of physically performing this
    conveying of source, or (2) access to copy the
    Corresponding Source from a network server at no charge.

    c) Convey individual copies of the object code with a copy of the
    written offer to provide the Corresponding Source.  This
    alternative is allowed only occasionally and noncommercially, and
    only if you received the object code with such an offer, in accord
    with subsection 6b.

    d) Convey the object code by offering access from a designated
    place (gratis or for a charge), and offer equivalent access to the
    Corresponding Source in the same way through the same place at no
    further charge.  You need not require recipients to copy the
    Corresponding Source along with the object code.  If the place to
    copy the object code is a network server, the Corresponding Source
    may be on a different server (operated by you or a third party)
    that supports equivalent copying facilities, provided you maintain
    clear directions next to the object code saying where to find the
    Corresponding Source.  Regardless of what server hosts the
    Corresponding Source, you remain obligated to ensure that it is
    available for as long as needed to satisfy these requirements.

    e) Convey the object code using peer-to-peer transmission, provided
    you inform other peers where the object code and Corresponding
    Source of the work are being offered to the general public at no
    charge under subsection 6d.

  A separable portion of the object code, whose source code is excluded
from the Corresponding Source as a System Library, need not be
included in conveying the object code work.

  A "User Product" is either (1) a "consumer product", which means any
tangible personal property which is normally used for personal, family,
or household purposes, or (2) anything designed or sold for incorporation
into a dwelling.  In determining whether a product is a consumer product,
doubtful cases shall be resolved in favor of coverage.  For a particular
product received by a particular user, "normally used" refers to a
typical or common use of that class of product, regardless of the status
of the particular user or of the way in which the particular user
actually uses, or expects or is expected to use, the product.  A product
is a consumer product regardless of whether the product has substantial
commercial, industrial or non-consumer uses, unless such uses represent
the only significant mode of use of the product.

  "Installation Information" for a User Product means any methods,
procedures, authorization keys, or other information required to install
and execute modified versions of a covered work in that User Product from
a modified version of its Corresponding Source.  The information must
suffice to ensure that the continued functioning of the modified object
code is in no case prevented or interfered with solely because
modification has been made.

  If you convey an object code work under this section in, or with, or
specifically for use in, a User Product, and the conveying occurs as
part of a transaction in which the right of possession and use of the
User Product is transferred to the recipient in perpetuity or for a
fixed term (regardless of how the transaction is characterized), the
Corresponding Source conveyed under this section must be accompanied
by the Installation Information.  But this requirement does not apply
if neither you nor any third party retains the ability to install
modified object code on the User Product (for example, the work has
been installed in ROM).

  The requirement to provide Installation Information does not include a
requirement to continue to provide support service, warranty, or updates
for a work that has been modified or installed by the recipient, or for
the User Product in which it has been modified or installed.  Access to a
network may be denied when the modification itself materially and
adversely affects the operation of the network or violates the rules and
protocols for communication across the network.

  Corresponding Source conveyed, and Installation Information provided,
in accord with this section must be in a format that is publicly
documented (and with an implementation available to the public in
source code form), and must require no special password or key for
unpacking, reading or copying.

  7. Additional Terms.

  "Additional permissions" are terms that supplement the terms of this
License by making exceptions from one or more of its conditions.
Additional permissions that are applicable to the entire Program shall
be treated as though they were included in this License, to the extent
that they are valid under applicable law.  If additional permissions
apply only to part of the Program, that part may be used separately
under those permissions, but the entire Program remains governed by
this License without regard to the additional permissions.

  When you convey a copy of a covered work, you may at your option
remove any additional permissions from that copy, or from any part of
it.  (Additional permissions may be written to require their own
removal in certain cases when you modify the work.)  You may place
additional permissions on material, added by you to a covered work,
for which you have or can give appropriate copyright permission.

  Notwithstanding any other provision of this License, for material you
add to a covered work, you may (if authorized by the copyright holders of
that material) supplement the terms of this License with terms:

    a) Disclaiming warranty or limiting liability differently from the
    terms of sections 15 and 16 of this License; or

    b) Requiring preservation of specified reasonable legal notices or
    author attributions in that material or in the Appropriate Legal
    Notices displayed by works containing it; or

    c) Prohibiting misrepresentation of the origin of that material, or
    requiring that modified versions of such material be marked in
    reasonable ways as different from the original version; or

    d) Limiting the use for publicity purposes of names of licensors or
    authors of the material; or

    e) Declining to grant rights under trademark law for use of some
    trade names, trademarks, or service marks; or

    f) Requiring indemnification of licensors and authors of that
    material by anyone who conveys the material (or modified versions of
    it) with contractual assumptions of liability to the recipient, for
    any liability that these contractual assumptions directly impose on
    those licensors and authors.

  All other non-permissive additional terms are considered "further
restrictions" within the meaning of section 10.  If the Program as you
received it, or any part of it, contains a notice stating that it is
governed by this License along with a term that is a further
restriction, you may remove that term.  If a license document contains
a further restriction but permits relicensing or conveying under this
License, you may add to a covered work material governed by the terms
of that license document, provided that the further restriction does
not survive such relicensing or conveying.

  If you add terms to a covered work in accord with this section, you
must place, in the relevant source files, a statement of the
additional terms that apply to those files, or a notice indicating
where to find the applicable terms.

  Additional terms, permissive or non-permissive, may be stated in the
form of a separately written license, or stated as exceptions;
the above requirements apply either way.

  8. Termination.

  You may not propagate or modify a covered work except as expressly
provided under this License.  Any attempt otherwise to propagate or
modify it is void, and will automatically terminate your rights under
this License (including any patent licenses granted under the third
paragraph of section 11).

  However, if you cease all violation of this License, then your
license from a particular copyright holder is reinstated (a)
provisionally, unless and until the copyright holder explicitly and
finally terminates your license, and (b) permanently, if the copyright
holder fails to notify you of the violation by some reasonable means
prior to 60 days after the cessation.

  Moreover, your license from a particular copyright holder is
reinstated permanently if the copyright holder notifies you of the
violation by some reasonable means, this is the first time you have
received notice of violation of this License (for any work) from that
copyright holder, and you cure the violation prior to 30 days after
your receipt of the notice.

  Termination of your rights under this section does not terminate the
licenses of parties who have received copies or rights from you under
this License.  If your rights have been terminated and not permanently
reinstated, you do not qualify to receive new licenses for the same
material under section 10.

  9. Acceptance Not Required for Having Copies.

  You are not required to accept this License in order to receive or
run a copy of the Program.  Ancillary propagation of a covered work
occurring solely as a consequence of using peer-to-peer transmission
to receive a copy likewise does not require acceptance.  However,
nothing other than this License grants you permission to propagate or
modify any covered work.  These actions infringe copyright if you do
not accept this License.  Therefore, by modifying or propagating a
covered work, you indicate your acceptance of this License to do so.

  10. Automatic Licensing of Downstream Recipients.

  Each time you convey a covered work, the recipient automatically
receives a license from the original licensors, to run, modify and
propagate that work, subject to this License.  You are not responsible
for enforcing compliance by third parties with this License.

  An "entity transaction" is a transaction transferring control of an
organization, or substantially all assets of one, or subdividing an
organization, or merging organizations.  If propagation of a covered
work results from an entity transaction, each party to that
transaction who receives a copy of the work also receives whatever
licenses to the work the party's predecessor in interest had or could
give under the previous paragraph, plus a right to possession of the
Corresponding Source of the work from the predecessor in interest, if
the predecessor has it or can get it with reasonable efforts.

  You may not impose any further restrictions on the exercise of the
rights granted or affirmed under this License.  For example, you may
not impose a license fee, royalty, or other charge for exercise of
rights granted under this License, and you may not initiate litigation
(including a cross-claim or counterclaim in a lawsuit) alleging that
any patent claim is infringed by making, using, selling, offering for
sale, or importing the Program or any portion of it.

  11. Patents.

  A "contributor" is a copyright holder who authorizes use under this
License of the Program or a work on which the Program is based.  The
work thus licensed is called the contributor's "contributor version".

  A contributor's "essential patent claims" are all patent claims
owned or controlled by the contributor, whether already acquired or
hereafter acquired, that would be infringed by some manner, permitted
by this License, of making, using, or selling its contributor version,
but do not include claims that would be infringed only as a
consequence of further modification of the contributor version.  For
purposes of this definition, "control" includes the right to grant
patent sublicenses in a manner consistent with the requirements of
this License.

  Each contributor grants you a non-exclusive, worldwide, royalty-free
patent license under the contributor's essential patent claims, to
make, use, sell, offer for sale, import and otherwise run, modify and
propagate the contents of its contributor version.

  In the following three paragraphs, a "patent license" is any express
agreement or commitment, however denominated, not to enforce a patent
(such as an express permission to practice a patent or covenant not to
sue for patent infringement).  To "grant" such a patent license to a
party means to make such an agreement or commitment not to enforce a
patent against the party.

  If you convey a covered work, knowingly relying on a patent license,
and the Corresponding Source of the work is not available for anyone
to copy, free of charge and under the terms of this License, through a
publicly available network server or other readily accessible means,
then you must either (1) cause the Corresponding Source to be so
available, or (2) arrange to deprive yourself of the benefit of the
patent license for this particular work, or (3) arrange, in a manner
consistent with the requirements of this License, to extend the patent
license to downstream recipients.  "Knowingly relying" means you have
actual knowledge that, but for the patent license, your conveying the
covered work in a country, or your recipient's use of the covered work
in a country, would infringe one or more identifiable patents in that
country that you have reason to believe are valid.

  If, pursuant to or in connection with a single transaction or
arrangement, you convey, or propagate by procuring conveyance of, a
covered work, and grant a patent license to some of the parties
receiving the covered work authorizing them to use, propagate, modify
or convey a specific copy of the covered work, then the patent license
you grant is automatically extended to all recipients of the covered
work and works based on it.

  A patent license is "discriminatory" if it does not include within
the scope of its coverage, prohibits the exercise of, or is
conditioned on the non-exercise of one or more of the rights that are
specifically granted under this License.  You may not convey a covered
work if you are a party to an arrangement with a third party that is
in the business of distributing software, under which you make payment
to the third party based on the extent of your activity of conveying
the work, and under which the third party grants, to any of the
parties who would receive the covered work from you, a discriminatory
patent license (a) in connection with copies of the covered work
conveyed by you (or copies made from those copies), or (b) primarily
for and in connection with specific products or compilations that
contain the covered work, unless you entered into that arrangement,
or that patent license was granted, prior to 28 March 2007.

  Nothing in this License shall be construed as excluding or limiting
any implied license or other defenses to infringement that may
otherwise be available to you under applicable patent law.

  12. No Surrender of Others' Freedom.

  If conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License.  If you cannot convey a
covered work so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you may
not convey it at all.  For example, if you agree to terms that obligate you
to collect a royalty for further conveying from those to whom you convey
the Program, the only way you could satisfy both those terms and this
License would be to refrain entirely from conveying the Program.

  13. Use with the GNU Affero General Public License.

  Notwithstanding any other provision of this License, you have
permission to link or combine any covered work with a work licensed
under version 3 of the GNU Affero General Public License into a single
combined work, and to convey the resulting work.  The terms of this
License will continue to apply to the part which is the covered work,
but the special requirements of the GNU Affero General Public License,
section 13, concerning interaction through a network will apply to the
combination as such.

  14. Revised Versions of this License.

  The Free Software Foundation may publish revised and/or new versions of
the GNU General Public License from time to time.  Such new versions will
be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.

  Each version is given a distinguishing version number.  If the
Program specifies that a certain numbered version of the GNU General
Public License "or any later version" applies to it, you have the
option of following the terms and conditions either of that numbered
version or of any later version published by the Free Software
Foundation.  If the Program does not specify a version number of the
GNU General Public License, you may choose any version ever published
by the Free Software Foundation.

  If the Program specifies that a proxy can decide which future
versions of the GNU General Public License can be used, that proxy's
public statement of acceptance of a version permanently authorizes you
to choose that version for the Program.

  Later license versions may give you additional or different
permissions.  However, no additional obligations are imposed on any
author or copyright holder as a result of your choosing to follow a
later version.

  15. Disclaimer of Warranty.

  THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
APPLICABLE LAW.  EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
IS WITH YOU.  SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
ALL NECESSARY SERVICING, REPAIR OR CORRECTION.

  16. Limitation of Liability.

  IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
SUCH DAMAGES.

  17. Interpretation of Sections 15 and 16.

  If the disclaimer of warranty and limitation of liability provided
above cannot be given local legal effect according to their terms,
reviewing courts shall apply local law that most closely approximates
an absolute waiver of all civil liability in connection with the
Program, unless a warranty or assumption of liability accompanies a
copy of the Program in return for a fee.

                     END OF TERMS AND CONDITIONS

            How to Apply These Terms to Your New Programs

  If you develop a new program, and you want it to be of the greatest
possible use to the public, the best way to achieve this is to make it
free software which everyone can redistribute and change under these terms.

  To do so, attach the following notices to the program.  It is safest
to attach them to the start of each source file to most effectively
state the exclusion of warranty; and each file should have at least
the "copyright" line and a pointer to where the full notice is found.

    {one line to give the program's name and a brief idea of what it does.}
    Copyright (C) {year}  {name of author}

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <http://www.gnu.org/licenses/>.

Also add information on how to contact you by electronic and paper mail.

  If the program does terminal interaction, make it output a short
notice like this when it starts in an interactive mode:

    {project}  Copyright (C) {year}  {fullname}
    This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
    This is free software, and you are welcome to redistribute it
    under certain conditions; type `show c' for details.

The hypothetical commands `show w' and `show c' should show the appropriate
parts of the General Public License.  Of course, your program's commands
might be different; for a GUI interface, you would use an "about box".

  You should also get your employer (if you work as a programmer) or school,
if any, to sign a "copyright disclaimer" for the program, if necessary.
For more information on this, and how to apply and follow the GNU GPL, see
<http://www.gnu.org/licenses/>.

  The GNU General Public License does not permit incorporating your program
into proprietary programs.  If your program is a subroutine library, you
may consider it more useful to permit linking proprietary applications with
the library.  If this is what you want to do, use the GNU Lesser General
Public License instead of this License.  But first, please read
<http://www.gnu.org/philosophy/why-not-lgpl.html>.



================================================
FILE: NAMESPACE
================================================
# Generated by roxygen2: do not edit by hand

S3method(adjacencylist,NN)
S3method(adjacencylist,frNN)
S3method(adjacencylist,kNN)
S3method(as.dendrogram,default)
S3method(as.dendrogram,hclust)
S3method(as.dendrogram,hdbscan)
S3method(as.dendrogram,optics)
S3method(as.dendrogram,reachability)
S3method(as.reachability,dendrogram)
S3method(as.reachability,optics)
S3method(augment,dbscan)
S3method(augment,general_clustering)
S3method(augment,hdbscan)
S3method(comps,dist)
S3method(comps,frNN)
S3method(comps,kNN)
S3method(comps,sNN)
S3method(glance,dbscan)
S3method(glance,general_clustering)
S3method(glance,hdbscan)
S3method(ncluster,default)
S3method(nnoise,default)
S3method(nobs,dbscan)
S3method(nobs,general_clustering)
S3method(nobs,hdbscan)
S3method(plot,NN)
S3method(plot,hdbscan)
S3method(plot,optics)
S3method(plot,reachability)
S3method(predict,dbscan_fast)
S3method(predict,hdbscan)
S3method(predict,optics)
S3method(print,dbscan_fast)
S3method(print,frNN)
S3method(print,general_clustering)
S3method(print,hdbscan)
S3method(print,kNN)
S3method(print,optics)
S3method(print,reachability)
S3method(print,sNN)
S3method(sort,NN)
S3method(sort,frNN)
S3method(sort,kNN)
S3method(sort,sNN)
S3method(tidy,dbscan)
S3method(tidy,general_clustering)
S3method(tidy,hdbscan)
export(adjacencylist)
export(as.dendrogram)
export(as.reachability)
export(augment)
export(clplot)
export(comps)
export(coredist)
export(dbcv)
export(dbscan)
export(extractDBSCAN)
export(extractFOSC)
export(extractXi)
export(frNN)
export(glance)
export(glosh)
export(hdbscan)
export(hullplot)
export(is.corepoint)
export(jpclust)
export(kNN)
export(kNNdist)
export(kNNdistplot)
export(lof)
export(mrdist)
export(ncluster)
export(nnoise)
export(optics)
export(pointdensity)
export(sNN)
export(sNNclust)
export(tidy)
import(Rcpp)
importFrom(generics,augment)
importFrom(generics,glance)
importFrom(generics,tidy)
importFrom(grDevices,adjustcolor)
importFrom(grDevices,chull)
importFrom(grDevices,palette)
importFrom(graphics,abline)
importFrom(graphics,lines)
importFrom(graphics,matplot)
importFrom(graphics,par)
importFrom(graphics,plot)
importFrom(graphics,points)
importFrom(graphics,polygon)
importFrom(graphics,segments)
importFrom(graphics,text)
importFrom(stats,as.dendrogram)
importFrom(stats,dendrapply)
importFrom(stats,dist)
importFrom(stats,hclust)
importFrom(stats,is.leaf)
importFrom(stats,nobs)
importFrom(stats,prcomp)
importFrom(stats,predict)
importFrom(utils,tail)
useDynLib(dbscan, .registration=TRUE)


================================================
FILE: NEWS.md
================================================
# dbscan 1.2.4 (2025-12-18)

## Bugfixes
* dbscan now checks for matrices with 0 rows or 0 columns
  (reported by maldridgeepa).
* Fixed license information for the ANN library header files (reported by 
  Charles Plessy).

# dbscan 1.2.3 (2025-08-20)

## Bugfixes
* plot.hdbscan gained parameters main, ylab, and leaflab (reported by nhward).

## Changes
* Fixed  partial argument matches.

# dbscan 1.2.2 (2025-01-24)

## Changes
* Removed dependence on the /bits/stdc++.h header. 

# dbscan 1.2.1 (2025-01-23)

## Changes
* Various refactoring by m-muecke

## New Features
* HDBSCAN gained parameter cluster_selection_epsilon to implement 
  clusters selected from Malzer and Baum (2020).
* Functions ncluster() and nnoise() were added.
* hullplot now() marks noise as x.
* Added clplot().
* pointdensity now also accepts a dist object as input and has the new type
  "gaussian" to calculate a Gaussian kernel estimate.
* Added the DBCV index.

## Bugfixes
* extractFOCS: Fixed total_score.
* Rewrote minimal spanning tree code.

# dbscan 1.2-0 (2024-06-28)

## New Features
* dbscan has now tidymodels tidiers (glance, tidy, augment).
* kNNdistplot can now plot a range of k/minPts values.
* added stats::nobs methods for the clusterings.
* kNN and frNN now contains the used distance metric.

## Changes
* dbscan component dist was renamed to metric. 
* Removed redundant sort in kNNdistplot (reported by Natasza Szczypien).
* Refactoring use anyNA(x) instead of any(is.na(x))
  and many more (by m-muecke).
* Reorganized the C++ source code.
* README now uses bibtex.
* Tests use now testthat edition 3 (m-muecke).

# dbscan 1.1-12 (2023-11-28)

## Bugfixes
* point_density checks now for missing values (reported by soelderer).
* Removed C++11 specification.
* ANN.cpp: fixed Rprintf warning.

# dbscan 1.1-11 (2022-10-26)

## New Features
* kNNdistplot gained parameter minPts.
* dbscan now retains information on distance method and border points.
* HDBSCAN now supports long vectors to work with larger distance matrices. 
* conversion from dist to kNN and frNN is now more memory efficient. It does no longer 
  coerce the dist object into a matrix of double the size, but extract the distances directly
  from the dist object.
* Better description of how predict uses only Euclidean distances and more error checking.
* The package now exports a new generic for as.dendrogram().

## Bugfixes
* is.corepoint() now uses the correct epsilon value (reported by Eng Aun).
* functions now check for cluster::dissimilariy objects which have class dist 
  but missing attributes.

# dbscan 1.1-10 (2022-01-14)

## New Features
* is.corepoint() for DBSCAN.
* coredist() and mrdist() for HDBSCAN.
* find connected components with comps().

## Changes
* reachability plot now shows all undefined distances as a dashed line.

## Bugfixes
* memory leak in mrd calculation fixed.

# dbscan 1.1-9 (2022-01-10)

## Changes
* We use now roxygen2.  

## New Features
* Added predict for hdbscan (as suggested by moredatapls)

# dbscan 1.1-8 (2021-04-26)

## Bugfixes
* LOF: fixed numerical issues with k-nearest neighbor distance on Solaris.

# dbscan 1.1-7 (2021-04-21)

## Bugfixes
* Fixed description of k in knndistplot and added minPts argument.
* Fixed bug for tied distances in lof (reported by sverchkov).

## Changes
* lof: the density parameter was changes to minPts to be consistent with the original paper and dbscan. Note that minPts = k + 1.

# dbscan 1.1-6 (2021-02-24)

## Improvements 
* Improved speed of LOF for large ks (following suggestions by eduardokapp). 
* kNN: results is now not sorted again for kd-tree queries which is much faster (by a factor of 10).
* ANN library: annclose() is now only called once when the package is unloaded. This is in preparation to support persistent kd-trees using external pointers.
* hdbscan lost parameter xdist.

## Bugfixes
* removed dependence on methods.
* fixed problem in hullplot for singleton clusters (reported by Fernando Archuby).
* GLOSH now also accepts data.frames.
* GLOSH returns now 0 instead of NaN if we have k duplicate points in the data.

# dbscan 1.1-5 (2019-10-22)

## New Features
* kNN and frNN gained parameter query to query neighbors for points not in the data.
* sNN gained parameter jp to decide if the shared NN should be counted using the definition by Jarvis and Patrick.


# dbscan 1.1-4 (2019-08-05)

## New Features
* kNNdist gained parameter all to indicate if a matrix with the distance to all 
  nearest neighbors up to k should be returned.

## Bugfixes
* kNNdist now correctly returns the distances to the kth neighbor 
  (reported by zschuster).
* dbscan: check eps and minPts parameters to avoid undefined results (reported by ArthurPERE).


# dbscan 1.1-3 (2018-11-12)

## Bugfixes
* pointdensity was double counting the query point (reported by Marius Hofert).

# dbscan 1.1-2 (2018-05-18)

## New Features
* OPTICS now calculates eps if it is omitted.

## Bugfixes
* Example now only uses igraph conditionally since it is unavailable 
  on Solaris (reported by B. Ripley).

# dbscan 1.1-1 (2017-03-19)

## Bugfixes

* Fixed problem with constant name on Solaris in ANN code (reported by B. Ripley).

# dbscan 1.1-0 (2017-03-18)

## New Features

* HDBSCAN was added.
* extractFOSC (optimal selection of clusters for HDBSCAN) was added.
* GLOSH outlier score was added.
* hullplot uses now filled polygons as the default.
* hullplot now used PCA if the data has more than 2 dimensions.
* Added NN superclass for kNN and frNN with plot and with adjacencylist().
* Added shared nearest neighbor clustering as sNNclust() and sNN to calculate
  the number of shared nearest neighbors.
* Added pointdensity function.
* Unsorted kNN and frNN can now be sorted using sort().
* kNN and frNN now also accept kNN and frNN objects, respectively. This can 
  be used to create a new kNN (frNN) with a reduced k or eps.
* Datasets added: DS3 and moon.

## Interface Changes

* Improved interface for dbscan() and optics(): ... it now passed on to frNN.
* OPTICS clustering extraction methods are now called extractDBSCAN and 
  extractXi.
* kNN and frNN are now objects with a print function.
* dbscan now also accepts a frNN object as input.
* jpclust and sNNclust now return a list instead of just the 
  cluster assignments.

# dbscan 1.0-0 (2017-02-02)

## New Features

* The package has now a vignette.
* Jarvis-Patrick clustering is now available as jpclust().
* Improved interface for dbscan() and optics(): ... is now passed on to frNN.
* OPTICS clustering extraction methods are now called extractDBSCAN and 
  extractXi.
* hullplot uses now filled polygons as the default.
* hullplot now used PCA if the data has more than 2 dimensions.
* kNN and frNN are now objects with a print function.
* dbscan now also accepts a frNN object as input.


# dbscan 0.9-8 (2016-08-05)

## New Features

* Added hullplot to plot a scatter plot with added convex cluster hulls.
* OPTICS: added a predecessor correction step that is used by 
    the ELKI implementation (Matt Piekenbrock).  

## Bugfixes

* Fixed a memory problem in frNN (reported by Yilei He).

# dbscan 0.9-7 (2016-04-14)

* OPTICSXi is now implemented (thanks to Matt Piekenbrock).
* DBSCAN now also accepts MinPts (with a capital M) to be
    compatible with the fpc version.
* DBSCAN objects are now also of class db scan_fast to avoid clashes with fpc.
* DBSCAN and OPTICS have now predict functions.
* Added test for unhandled NAs.
* Fixed LOF for more than k duplicate points (reported by Samneet Singh).

# dbscan 0.9-6 (2015-12-14)

* OPTICS: fixed second bug reported by Di Pang
* all methods now also accept dist objects and have a search
    method "dist" which precomputes distances.

# dbscan 0.9-5 (2015-10-04)

* OPTICS: fixed bug with first observation reported by Di Pang
* OPTICS: clusterings can now be extracted using optics_cut

# dbscan 0.9-4 (2015-09-17)

* added tests (testthat).
* input data is now checked if it can safely be coerced into a
    numeric matrix (storage.mode double).
* fixed self matches in kNN and frNN (now returns the first NN correctly).

# dbscan 0.9-3 (2015-9-2)

* Added weights to DBSCAN.

# dbscan 0.9-2 (2015-08-11)

* Added kNN interface.
* Added frNN (fixed radius NN) interface.
* Added LOF.
* Added OPTICS.
* All algorithms check now for interrupt (CTRL-C/Esc).
* DBSCAN now returns a list instead of a numeric vector.

# dbscan 0.9-1 (2015-07-21)

* DBSCAN: Improved speed by avoiding repeated sorting of point ids.
* Added linear NN search option.
* Added fast calculation for kNN distance.
* fpc and microbenchmark are now used conditionally in the examples.

# dbscan 0.9-0 (2015-07-15)

* initial release


================================================
FILE: R/AAA_dbscan-package.R
================================================
#' @keywords internal
#'
#' @section Key functions:
#' - Clustering: [dbscan()], [hdbscan()], [optics()], [jpclust()], [sNNclust()]
#' - Outliers: [lof()], [glosh()], [pointdensity()]
#' - Nearest Neighbors: [kNN()], [frNN()], [sNN()]
#'
#' @references
#' Hahsler M, Piekenbrock M, Doran D (2019). dbscan: Fast Density-Based Clustering with R. Journal of Statistical Software, 91(1), 1-30. \doi{10.18637/jss.v091.i01}
#'
#' @import Rcpp
#' @importFrom graphics plot points lines text abline polygon par segments matplot
#' @importFrom grDevices palette chull adjustcolor
#' @importFrom stats dist hclust dendrapply as.dendrogram is.leaf prcomp
#' @importFrom utils tail
#'
#' @useDynLib dbscan, .registration=TRUE
"_PACKAGE"


================================================
FILE: R/AAA_definitions.R
================================================
#######################################################################
# dbscan - Density Based Clustering of Applications with Noise
#          and Related Algorithms
# Copyright (C) 2015 Michael Hahsler

# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program; if not, write to the Free Software Foundation, Inc.,
# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

.ANNsplitRule <- c("STD", "MIDPT", "FAIR", "SL_MIDPT", "SL_FAIR", "SUGGEST")

.matrixlike <- function(x) {
  if  (is.null(dim(x)))
       return(FALSE)

  # check that there is at least one row and one column!
  if (nrow(x) < 1L) stop("the provided data has 0 rows!")
  if (ncol(x) < 1L) stop("the provided data has 0 columns!")

  TRUE
}


================================================
FILE: R/DBCV_datasets.R
================================================
#######################################################################
# dbscan - Density Based Clustering of Applications with Noise
#          and Related Algorithms
# Copyright (C) 2015 Michael Hahsler

# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program; if not, write to the Free Software Foundation, Inc.,
# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

#' DBCV Paper Datasets
#'
#' The four synthetic 2D datasets used in Moulavi et al (2014).
#'
#' @name DBCV_datasets
#' @aliases Dataset_1 Dataset_2 Dataset_3 Dataset_4
#' @docType data
#' @format Four data frames with the following 3 variables.
#' \describe{
#' \item{x}{a numeric vector}
#' \item{y}{a numeric vector}
#' \item{class}{an integer vector indicating the class label. 0 means noise.} }
#' @references Davoud Moulavi and Pablo A. Jaskowiak and
#' Ricardo J. G. B. Campello and Arthur Zimek and Jörg Sander (2014).
#' Density-Based Clustering Validation. In
#' _Proceedings of the 2014 SIAM International Conference on Data Mining,_
#' pages 839-847
#' \doi{10.1137/1.9781611973440.96}
#' @source https://github.com/pajaskowiak/dbcv
#' @keywords datasets
#' @examples
#' data("Dataset_1")
#' clplot(Dataset_1[, c("x", "y")], cl = Dataset_1$class)
#'
#' data("Dataset_2")
#' clplot(Dataset_2[, c("x", "y")], cl = Dataset_2$class)
#'
#' data("Dataset_3")
#' clplot(Dataset_3[, c("x", "y")], cl = Dataset_3$class)
#'
#' data("Dataset_4")
#' clplot(Dataset_4[, c("x", "y")], cl = Dataset_4$class)
NULL





================================================
FILE: R/DS3.R
================================================
#######################################################################
# dbscan - Density Based Clustering of Applications with Noise
#          and Related Algorithms
# Copyright (C) 2015 Michael Hahsler

# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program; if not, write to the Free Software Foundation, Inc.,
# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.


#' DS3: Spatial data with arbitrary shapes
#'
#' Contains 8000 2-d points, with 6 "natural" looking shapes, all of which have
#' an sinusoid-like shape that intersects with each cluster.
#' The data set was originally used as a benchmark data set for the Chameleon clustering
#' algorithm (Karypis, Han and Kumar, 1999) to
#' illustrate the a data set containing arbitrarily shaped
#' spatial data surrounded by both noise and artifacts.
#'
#' @name DS3
#' @docType data
#' @format A data.frame with 8000 observations on the following 2 columns:
#' \describe{
#'   \item{X}{a numeric vector}
#'   \item{Y}{a numeric vector}
#' }
#'
#' @references Karypis, George, Eui-Hong Han, and Vipin Kumar (1999).
#' Chameleon: Hierarchical clustering using dynamic modeling. _Computer_
#' 32(8): 68-75.
#' @source Obtained from \url{http://cs.joensuu.fi/sipu/datasets/}
#' @keywords datasets
#' @examples
#' data(DS3)
#' plot(DS3, pch = 20, cex = 0.25)
NULL


================================================
FILE: R/GLOSH.R
================================================
#######################################################################
# dbscan - Density Based Clustering of Applications with Noise
#          and Related Algorithms
# Copyright (C) 2015 Michael Hahsler, Matthew Piekenbrock

# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program; if not, write to the Free Software Foundation, Inc.,
# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

#' Global-Local Outlier Score from Hierarchies
#'
#' Calculate the Global-Local Outlier Score from Hierarchies (GLOSH) score for
#' each data point using a kd-tree to speed up kNN search.
#'
#' GLOSH compares the density of a point to densities of any points associated
#' within current and child clusters (if any). Points that have a substantially
#' lower density than the density mode (cluster) they most associate with are
#' considered outliers. GLOSH is computed from a hierarchy a clusters.
#'
#' Specifically, consider a point \emph{x} and a density or distance threshold
#' \emph{lambda}. GLOSH is calculated by taking 1 minus the ratio of how long
#' any of the child clusters of the cluster \emph{x} belongs to "survives"
#' changes in \emph{lambda} to the highest \emph{lambda} threshold of x, above
#' which x becomes a noise point.
#'
#' Scores close to 1 indicate outliers. For more details on the motivation for
#' this calculation, see Campello et al (2015).
#'
#' @aliases glosh GLOSH
#' @family Outlier Detection Functions
#'
#' @param x an [hclust] object, data matrix, or [dist] object.
#' @param k size of the neighborhood.
#' @param ... further arguments are passed on to [kNN()].
#' @return A numeric vector of length equal to the size of the original data
#' set containing GLOSH values for all data points.
#' @author Matt Piekenbrock
#'
#' @references Campello, Ricardo JGB, Davoud Moulavi, Arthur Zimek, and Joerg
#' Sander. Hierarchical density estimates for data clustering, visualization,
#' and outlier detection. _ACM Transactions on Knowledge Discovery from Data
#' (TKDD)_ 10, no. 1 (2015).
#' \doi{10.1145/2733381}
#' @keywords model
#' @examples
#' set.seed(665544)
#' n <- 100
#' x <- cbind(
#'   x=runif(10, 0, 5) + rnorm(n, sd = 0.4),
#'   y=runif(10, 0, 5) + rnorm(n, sd = 0.4)
#'   )
#'
#' ### calculate GLOSH score
#' glosh <- glosh(x, k = 3)
#'
#' ### distribution of outlier scores
#' summary(glosh)
#' hist(glosh, breaks = 10)
#'
#' ### simple function to plot point size is proportional to GLOSH score
#' plot_glosh <- function(x, glosh){
#'   plot(x, pch = ".", main = "GLOSH (k = 3)")
#'   points(x, cex = glosh*3, pch = 1, col = "red")
#'   text(x[glosh > 0.80, ], labels = round(glosh, 3)[glosh > 0.80], pos = 3)
#' }
#' plot_glosh(x, glosh)
#'
#' ### GLOSH with any hierarchy
#' x_dist <- dist(x)
#' x_sl <- hclust(x_dist, method = "single")
#' x_upgma <- hclust(x_dist, method = "average")
#' x_ward <- hclust(x_dist, method = "ward.D2")
#'
#' ## Compare what different linkage criterion consider as outliers
#' glosh_sl <- glosh(x_sl, k = 3)
#' plot_glosh(x, glosh_sl)
#'
#' glosh_upgma <- glosh(x_upgma, k = 3)
#' plot_glosh(x, glosh_upgma)
#'
#' glosh_ward <- glosh(x_ward, k = 3)
#' plot_glosh(x, glosh_ward)
#'
#' ## GLOSH is automatically computed with HDBSCAN
#' all(hdbscan(x, minPts = 3)$outlier_scores == glosh(x, k = 3))
#' @export
glosh <- function(x, k = 4, ...) {
  if (inherits(x, "data.frame"))
    x <- as.matrix(x)

  # get n
  if (inherits(x, "dist") || inherits(x, "matrix")) {
    if (inherits(x, "dist"))
      n <- attr(x, "Size")
    else
      n <- nrow(x)
    # get k nearest neighbors + distances
    d <- kNN(x, k - 1, ...)
    x_dist <-
      if (inherits(x, "dist"))
        x
    else
      dist(x, method = "euclidean") # copy since mrd changes by reference!

    .check_dist(x_dist)
    mrd <- mrd(x_dist, d$dist[, k - 1])

    # need to assemble hclust object manually
    mst <- mst(mrd, n)
    hc <- hclustMergeOrder(mst, order(mst[, 3]))
  } else if (inherits(x, "hclust")) {
    hc <- x
    n <- nrow(hc$merge) + 1
  }
  else
    stop("x needs to be a matrix, dist, or hclust object!")

  if (k < 2 || k >= n)
    stop("k has to be larger than 1 and smaller than the number of points")

  res <- computeStability(hc, k, compute_glosh = TRUE)

  # return
  attr(res, "glosh")
}


================================================
FILE: R/LOF.R
================================================
#######################################################################
# dbscan - Density Based Clustering of Applications with Noise
#          and Related Algorithms
# Copyright (C) 2015 Michael Hahsler

# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program; if not, write to the Free Software Foundation, Inc.,
# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.


#' Local Outlier Factor Score
#'
#' Calculate the Local Outlier Factor (LOF) score for each data point using a
#' kd-tree to speed up kNN search.
#'
#' LOF compares the local readability density (lrd) of an point to the lrd of
#' its neighbors. A LOF score of approximately 1 indicates that the lrd around
#' the point is comparable to the lrd of its neighbors and that the point is
#' not an outlier. Points that have a substantially lower lrd than their
#' neighbors are considered outliers and produce scores significantly larger
#' than 1.
#'
#' If a data matrix is specified, then Euclidean distances and fast nearest
#' neighbor search using a kd-tree is used.
#'
#' **Note on duplicate points:** If there are more than `minPts`
#' duplicates of a point in the data, then LOF the local readability distance
#' will be 0 resulting in an undefined LOF score of 0/0. We set LOF in this
#' case to 1 since there is already enough density from the points in the same
#' location to make them not outliers. The original paper by Breunig et al
#' (2000) assumes that the points are real duplicates and suggests to remove
#' the duplicates before computing LOF. If duplicate points are removed first,
#' then this LOF implementation in \pkg{dbscan} behaves like the one described
#' by Breunig et al.
#'
#' @aliases lof LOF
#' @family Outlier Detection Functions
#'
#' @param x a data matrix or a [dist] object.
#' @param minPts number of nearest neighbors used in defining the local
#' neighborhood of a point (includes the point itself).
#' @param ... further arguments are passed on to [kNN()].
#' Note: `sort` cannot be specified here since `lof()`
#' uses always `sort = TRUE`.
#'
#' @return A numeric vector of length `ncol(x)` containing LOF values for
#' all data points.
#'
#' @author Michael Hahsler
#' @references Breunig, M., Kriegel, H., Ng, R., and Sander, J. (2000). LOF:
#' identifying density-based local outliers. In _ACM Int. Conf. on
#' Management of Data,_ pages 93-104.
#' \doi{10.1145/335191.335388}
#' @keywords model
#' @examples
#' set.seed(665544)
#' n <- 100
#' x <- cbind(
#'   x=runif(10, 0, 5) + rnorm(n, sd = 0.4),
#'   y=runif(10, 0, 5) + rnorm(n, sd = 0.4)
#'   )
#'
#' ### calculate LOF score with a neighborhood of 3 points
#' lof <- lof(x, minPts = 3)
#'
#' ### distribution of outlier factors
#' summary(lof)
#' hist(lof, breaks = 10, main = "LOF (minPts = 3)")
#'
#' ### plot sorted lof. Looks like outliers start arounf a LOF of 2.
#' plot(sort(lof), type = "l",  main = "LOF (minPts = 3)",
#'   xlab = "Points sorted by LOF", ylab = "LOF")
#'
#' ### point size is proportional to LOF and mark points with a LOF > 2
#' plot(x, pch = ".", main = "LOF (minPts = 3)", asp = 1)
#' points(x, cex = (lof - 1) * 2, pch = 1, col = "red")
#' text(x[lof > 2,], labels = round(lof, 1)[lof > 2], pos = 3)
#' @export
lof <- function(x, minPts = 5, ...) {
  ### parse extra parameters
  extra <- list(...)

  # check for deprecated k
  if (!is.null(extra[["k"]])) {
    minPts <- extra[["k"]] + 1
    extra[["k"]] <- NULL
    warning("lof: k is now deprecated. use minPts = ", minPts, " instead .")
  }

  args <- c("search", "bucketSize", "splitRule", "approx")
  m <- pmatch(names(extra), args)
  if (anyNA(m))
    stop("Unknown parameter: ",
      toString(names(extra)[is.na(m)]))
  names(extra) <- args[m]

  search <- extra$search %||% "kdtree"
  search <- .parse_search(search)
  splitRule <- extra$splitRule %||% "suggest"
  splitRule <- .parse_splitRule(splitRule)
  bucketSize <- if (is.null(extra$bucketSize))
    10L
  else
    as.integer(extra$bucketSize)
  approx <- if (is.null(extra$approx))
    0
  else
    as.double(extra$approx)

  ### precompute distance matrix for dist search
  if (search == 3 && !inherits(x, "dist")) {
    if (.matrixlike(x))
      x <- dist(x)
    else
      stop("x needs to be a matrix to calculate distances")
  }

  # get and check n
  if (inherits(x, "dist"))
    n <- attr(x, "Size")
  else
    n <- nrow(x)
  if (is.null(n))
    stop("x needs to be a matrix or a dist object!")
  if (minPts < 2 || minPts > n)
    stop("minPts has to be at least 2 and not larger than the number of points")


  ### get LOF from a dist object
  if (inherits(x, "dist")) {
    if (anyNA(x))
      stop("NAs not allowed in dist for LOF!")

    # find k-NN distance, ids and distances
    x <- as.matrix(x)
    diag(x) <- Inf ### no self-matches
    o <- t(apply(x, 1, order, decreasing = FALSE))
    k_dist <- x[cbind(o[, minPts - 1], seq_len(n))]
    ids <-
      lapply(
        seq_len(n),
        FUN = function(i)
          which(x[i,] <= k_dist[i])
      )
    dist <-
      lapply(
        seq_len(n),
        FUN = function(i)
          x[i, x[i,] <= k_dist[i]]
      )

    ret <- list(k_dist = k_dist,
      ids = ids,
      dist = dist)

  } else{
    ### Use kd-tree

    if (anyNA(x))
      stop("NAs not allowed for LOF using kdtree!")

    ret <- lof_kNN(
      as.matrix(x),
      as.integer(minPts),
      as.integer(search),
      as.integer(bucketSize),
      as.integer(splitRule),
      as.double(approx)
    )
  }

  # calculate local reachability density (LRD)
  # reachability-distance_k(A,B) = max{k-distance(B), d(A,B)}
  # lrdk(A) = 1/(sum_B \in N_k(A) reachability-distance_k(A, B) / |N_k(A)|)
  lrd <- numeric(n)
  for (A in seq_len(n)) {
    Bs <- ret$ids[[A]]
    lrd[A] <-
      1 / (sum(pmax.int(ret$k_dist[Bs], ret$dist[[A]])) / length(Bs))
  }

  # calculate local outlier factor (LOF)
  # LOF_k(A) = sum_B \in N_k(A) lrd_k(B)/(|N_k(A)| lrdk(A))
  lof <- numeric(n)
  for (A in seq_len(n)) {
    Bs <- ret$ids[[A]]
    lof[A] <- sum(lrd[Bs]) / length(Bs) / lrd[A]
  }

  # with more than k duplicates lrd can become infinity
  # we define them not to be outliers
  lof[is.nan(lof)] <- 1

  lof
}


================================================
FILE: R/NN.R
================================================
#######################################################################
# dbscan - Density Based Clustering of Applications with Noise
#          and Related Algorithms
# Copyright (C) 2015 Michael Hahsler

# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program; if not, write to the Free Software Foundation, Inc.,
# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

#' NN --- Nearest Neighbors Superclass
#'
#' NN is an abstract S3 superclass for the classes of the objects returned
#' by [kNN()], [frNN()] and [sNN()]. Methods for sorting, plotting and getting an
#' adjacency list are defined.
#'
#' @name NN
#' @aliases NN
#' @family NN functions
#'
#' @param x a `NN` object
#' @param pch plotting character.
#' @param col color used for the data points (nodes).
#' @param linecol color used for edges.
#' @param ... further parameters past on to [plot()].
#' @param decreasing sort in decreasing order?
#' @param data that was used to create `x`
#' @param main title
#'
#' @section Subclasses:
#' [kNN], [frNN] and [sNN]
#'
#' @author Michael Hahsler
#' @keywords model
#' @examples
#' data(iris)
#' x <- iris[, -5]
#'
#' # finding kNN directly in data (using a kd-tree)
#' nn <- kNN(x, k=5)
#' nn
#'
#' # plot the kNN where NN are shown as line conecting points.
#' plot(nn, x)
#'
#' # show the first few elements of the adjacency list
#' head(adjacencylist(nn))
#'
#' \dontrun{
#' # create a graph and find connected components (if igraph is installed)
#' library("igraph")
#' g <- graph_from_adj_list(adjacencylist(nn))
#' comp <- components(g)
#' plot(x, col = comp$membership)
#'
#' # detect clusters (communities) with the label propagation algorithm
#' cl <- membership(cluster_label_prop(g))
#' plot(x, col = cl)
#' }
NULL

#' @rdname NN
#' @export
adjacencylist <- function (x, ...)
  UseMethod("adjacencylist", x)

#' @rdname NN
#' @export
adjacencylist.NN <- function (x, ...) {
  stop("needs to be implemented by a subclass")
  }

#' @rdname NN
#' @export
sort.NN <- function(x, decreasing = FALSE, ...) {
  stop("needs to be implemented by a subclass")
  }


#' @rdname NN
#' @export
plot.NN <- function(x, data, main = NULL, pch = 16, col = NULL, linecol = "gray", ...) {
  if (is.null(main)) {
    if (inherits(x, "frNN"))
      main <- paste0("frNN graph (eps = ", x$eps, ")")
    if (inherits(x, "kNN"))
      main <- paste0(x$k, "-NN graph")
    if (inherits(x, "sNN"))
      main <- paste0("Shared NN graph (k=", x$k,
        ifelse(is.null(x$kt), "", paste0(", kt=", x$kt)), ")")
  }

  ## create an empty plot
  plot(data[, 1:2], main = main, type = "n", pch = pch, col = col, ...)

  id <- adjacencylist(x)

  ## use lines if it is from the same data
  ## FIXME: this test is not perfect, maybe we should have a parameter here or add the query points...
  if (length(id) == nrow(data)) {
    for (i in seq_along(id)) {
      for (j in seq_along(id[[i]]))
        lines(x = c(data[i, 1], data[id[[i]][j], 1]),
          y = c(data[i, 2], data[id[[i]][j], 2]), col = linecol,
          ...)
    }

    ## ad vertices
    points(data[, 1:2], main = main, pch = pch, col = col, ...)

  } else {
    ## ad vertices
    points(data[, 1:2], main = main, pch = pch, ...)
    ## use colors if it was from a query
    for (i in seq_along(id)) {
      points(data[id[[i]], ], pch = pch, col = i + 1L)
    }
  }
}


================================================
FILE: R/RcppExports.R
================================================
# Generated by using Rcpp::compileAttributes() -> do not edit by hand
# Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393

JP_int <- function(nn, kt) {
    .Call(`_dbscan_JP_int`, nn, kt)
}

SNN_sim_int <- function(nn, jp) {
    .Call(`_dbscan_SNN_sim_int`, nn, jp)
}

ANN_cleanup <- function() {
    invisible(.Call(`_dbscan_ANN_cleanup`))
}

comps_kNN <- function(nn, mutual) {
    .Call(`_dbscan_comps_kNN`, nn, mutual)
}

comps_frNN <- function(nn, mutual) {
    .Call(`_dbscan_comps_frNN`, nn, mutual)
}

intToStr <- function(iv) {
    .Call(`_dbscan_intToStr`, iv)
}

dist_subset <- function(dist, idx) {
    .Call(`_dbscan_dist_subset`, dist, idx)
}

XOR <- function(lhs, rhs) {
    .Call(`_dbscan_XOR`, lhs, rhs)
}

dspc <- function(cl_idx, internal_nodes, all_cl_ids, mrd_dist) {
    .Call(`_dbscan_dspc`, cl_idx, internal_nodes, all_cl_ids, mrd_dist)
}

dbscan_int <- function(data, eps, minPts, weights, borderPoints, type, bucketSize, splitRule, approx, frNN) {
    .Call(`_dbscan_dbscan_int`, data, eps, minPts, weights, borderPoints, type, bucketSize, splitRule, approx, frNN)
}

reach_to_dendrogram <- function(reachability, pl_order) {
    .Call(`_dbscan_reach_to_dendrogram`, reachability, pl_order)
}

dendrogram_to_reach <- function(x) {
    .Call(`_dbscan_dendrogram_to_reach`, x)
}

mst_to_dendrogram <- function(mst) {
    .Call(`_dbscan_mst_to_dendrogram`, mst)
}

dbscan_density_int <- function(data, eps, type, bucketSize, splitRule, approx) {
    .Call(`_dbscan_dbscan_density_int`, data, eps, type, bucketSize, splitRule, approx)
}

frNN_int <- function(data, eps, type, bucketSize, splitRule, approx) {
    .Call(`_dbscan_frNN_int`, data, eps, type, bucketSize, splitRule, approx)
}

frNN_query_int <- function(data, query, eps, type, bucketSize, splitRule, approx) {
    .Call(`_dbscan_frNN_query_int`, data, query, eps, type, bucketSize, splitRule, approx)
}

distToAdjacency <- function(constraints, N) {
    .Call(`_dbscan_distToAdjacency`, constraints, N)
}

buildDendrogram <- function(hcl) {
    .Call(`_dbscan_buildDendrogram`, hcl)
}

all_children <- function(hier, key, leaves_only = FALSE) {
    .Call(`_dbscan_all_children`, hier, key, leaves_only)
}

node_xy <- function(cl_tree, cl_hierarchy, cid = 0L) {
    .Call(`_dbscan_node_xy`, cl_tree, cl_hierarchy, cid)
}

simplifiedTree <- function(cl_tree) {
    .Call(`_dbscan_simplifiedTree`, cl_tree)
}

computeStability <- function(hcl, minPts, compute_glosh = FALSE) {
    .Call(`_dbscan_computeStability`, hcl, minPts, compute_glosh)
}

validateConstraintList <- function(constraints, n) {
    .Call(`_dbscan_validateConstraintList`, constraints, n)
}

computeVirtualNode <- function(noise, constraints) {
    .Call(`_dbscan_computeVirtualNode`, noise, constraints)
}

fosc <- function(cl_tree, cid, sc, cl_hierarchy, prune_unstable_leaves = FALSE, cluster_selection_epsilon = 0.0, alpha = 0, useVirtual = FALSE, n_constraints = 0L, constraints = NULL) {
    .Call(`_dbscan_fosc`, cl_tree, cid, sc, cl_hierarchy, prune_unstable_leaves, cluster_selection_epsilon, alpha, useVirtual, n_constraints, constraints)
}

extractUnsupervised <- function(cl_tree, prune_unstable = FALSE, cluster_selection_epsilon = 0.0) {
    .Call(`_dbscan_extractUnsupervised`, cl_tree, prune_unstable, cluster_selection_epsilon)
}

extractSemiSupervised <- function(cl_tree, constraints, alpha = 0, prune_unstable_leaves = FALSE, cluster_selection_epsilon = 0.0) {
    .Call(`_dbscan_extractSemiSupervised`, cl_tree, constraints, alpha, prune_unstable_leaves, cluster_selection_epsilon)
}

kNN_query_int <- function(data, query, k, type, bucketSize, splitRule, approx) {
    .Call(`_dbscan_kNN_query_int`, data, query, k, type, bucketSize, splitRule, approx)
}

kNN_int <- function(data, k, type, bucketSize, splitRule, approx) {
    .Call(`_dbscan_kNN_int`, data, k, type, bucketSize, splitRule, approx)
}

lof_kNN <- function(data, minPts, type, bucketSize, splitRule, approx) {
    .Call(`_dbscan_lof_kNN`, data, minPts, type, bucketSize, splitRule, approx)
}

mrd <- function(dm, cd) {
    .Call(`_dbscan_mrd`, dm, cd)
}

mst <- function(x_dist, n) {
    .Call(`_dbscan_mst`, x_dist, n)
}

hclustMergeOrder <- function(mst, o) {
    .Call(`_dbscan_hclustMergeOrder`, mst, o)
}

optics_int <- function(data, eps, minPts, type, bucketSize, splitRule, approx, frNN) {
    .Call(`_dbscan_optics_int`, data, eps, minPts, type, bucketSize, splitRule, approx, frNN)
}

lowerTri <- function(m) {
    .Call(`_dbscan_lowerTri`, m)
}



================================================
FILE: R/broom-dbscan-tidiers.R
================================================
#' Turn an dbscan clustering object into a tidy tibble
#'
#' Provides [tidy()][generics::tidy()], [augment()][generics::augment()], and
#' [glance()][generics::glance()] verbs for clusterings created with algorithms
#' in package `dbscan` to work with [tidymodels](https://www.tidymodels.org/).
#'
#' @param x An `dbscan` object returned from [dbscan::dbscan()].
#' @param data The data used to create the clustering.
#' @param newdata New data to predict cluster labels for.
#' @param ... further arguments are ignored without a warning.
#'
#' @name dbscan_tidiers
#' @aliases dbscan_tidiers glance tidy augment
#' @family tidiers
#'
#' @seealso [generics::tidy()], [generics::augment()],
#'  [generics::glance()], [dbscan()]
#'
#' @examplesIf requireNamespace("tibble", quietly = TRUE) && identical(Sys.getenv("NOT_CRAN"), "true")
#'
#' data(iris)
#' x <- scale(iris[, 1:4])
#'
#' ## dbscan
#' db <- dbscan(x, eps = .9, minPts = 5)
#' db
#'
#' # summarize model fit with tidiers
#' tidy(db)
#' glance(db)
#'
#' # augment for this model needs the original data
#' augment(db, x)
#'
#' # to augment new data, the original data is also needed
#' augment(db, x, newdata = x[1:5, ])
#'
#' ## hdbscan
#' hdb <- hdbscan(x, minPts = 5)
#'
#' # summarize model fit with tidiers
#' tidy(hdb)
#' glance(hdb)
#'
#' # augment for this model needs the original data
#' augment(hdb, x)
#'
#' # to augment new data, the original data is also needed
#' augment(hdb, x, newdata = x[1:5, ])
#'
#' ## Jarvis-Patrick clustering
#' cl <- jpclust(x, k = 20, kt = 15)
#'
#' # summarize model fit with tidiers
#' tidy(cl)
#' glance(cl)
#'
#' # augment for this model needs the original data
#' augment(cl, x)
#'
#' ## Shared Nearest Neighbor clustering
#' cl <- sNNclust(x, k = 20, eps = 0.8, minPts = 15)
#'
#' # summarize model fit with tidiers
#' tidy(cl)
#' glance(cl)
#'
#' # augment for this model needs the original data
#' augment(cl, x)
#'
NULL

#' @rdname dbscan_tidiers
#' @importFrom generics tidy
#' @export
generics::tidy


#' @rdname dbscan_tidiers
#' @export
tidy.dbscan <- function(x, ...) {
  n_cl <- max(x$cluster)
  size <- table(factor(x$cluster, levels = 0:n_cl))

  tb <- tibble::tibble(cluster = as.factor(0:n_cl),
         size = as.integer(size))

  tb$noise <- tb$cluster == 0L
  tb
}

#' @rdname dbscan_tidiers
#' @export
tidy.hdbscan <- function(x, ...) {
  n_cl <- max(x$cluster)
  size <- table(factor(x$cluster, levels = 0:n_cl))

  tb <- tibble::tibble(cluster = as.factor(0:n_cl),
         size = as.integer(size))
  tb$cluster_score <- as.numeric(x$cluster_scores[as.character(tb$cluster)])
  tb$noise <- tb$cluster == 0L

  tb
}

#' @rdname dbscan_tidiers
#' @export
tidy.general_clustering <- function(x, ...) {
  n_cl <- max(x$cluster)
  size <- table(factor(x$cluster, levels = 0:n_cl))

  tb <- tibble::tibble(cluster = as.factor(0:n_cl),
         size = as.integer(size))
  tb$noise <- tb$cluster == 0L

  tb
}


## augment

#' @importFrom generics augment
#' @rdname dbscan_tidiers
#' @export
generics::augment


#' @rdname dbscan_tidiers
#' @export
augment.dbscan <- function(x, data = NULL, newdata = NULL, ...) {
  n_cl <- max(x$cluster)

  if (is.null(data) && is.null(newdata))
    stop("Must specify either `data` or `newdata` argument.")

  if (is.null(data) || nrow(data) != length(x$cluster)) {
    stop("The original data needs to be passed as data.")
  }

  if (is.null(newdata)) {
    tb <- tibble::as_tibble(data)
    tb$.cluster <- factor(x$cluster, levels = 0:n_cl)
  } else {
    tb <- tibble::as_tibble(newdata)
    tb$.cluster <- factor(predict(x,
                                  newdata = newdata,
                                  data = data), levels = 0:n_cl)
  }

  tb$noise <- tb$.cluster == 0L

  tb
}

#' @rdname dbscan_tidiers
#' @export
augment.hdbscan <- function(x, data = NULL, newdata = NULL, ...) {
  n_cl <- max(x$cluster)

  if (is.null(data) || nrow(data) != length(x$cluster)) {
    stop("The original data needs to be passed as data.")
  }

  if (is.null(newdata)) {
    tb <- tibble::as_tibble(data)
    tb$.cluster <- factor(x$cluster, levels = 0:n_cl)
    tb$.coredist <- x$coredist
    tb$.membership_prob <- x$membership_prob
    tb$.outlier_scores <- x$outlier_scores
  } else {
    tb <- tibble::as_tibble(newdata)
    tb$.cluster <- factor(
        predict(x, newdata = newdata, data = data), levels = 0:n_cl)
    tb$.coredist <- NA_real_
    tb$.membership_prob <- NA_real_
    tb$.outlier_scores <- NA_real_
  }

  tb
}

#' @rdname dbscan_tidiers
#' @export
augment.general_clustering <- function(x, data = NULL, newdata = NULL, ...) {
  n_cl <- max(x$cluster)

  if (is.null(data) || nrow(data) != length(x$cluster)) {
    stop("The original data needs to be passed as data.")
  }

  if (is.null(newdata)) {
    tb <- tibble::as_tibble(data)
    tb$.cluster <- factor(x$cluster, levels = 0:n_cl)
  } else {
    stop("augmenting new data is not supported.")
  }

  tb
}



## glance
#' @importFrom generics glance
#' @rdname dbscan_tidiers
#' @export
generics::glance


#' @rdname dbscan_tidiers
#' @export
glance.dbscan <- function(x, ...) {
  tibble::tibble(
    nobs = length(x$cluster),
    n.clusters = length(table(x$cluster[x$cluster != 0L])),
    nexcluded = sum(x$cluster == 0L)
  )
}

#' @rdname dbscan_tidiers
#' @export
glance.hdbscan <- function(x, ...) {
  tibble::tibble(
    nobs = length(x$cluster),
    n.clusters = length(table(x$cluster[x$cluster != 0L])),
    nexcluded = sum(x$cluster == 0L)
  )
}

#' @rdname dbscan_tidiers
#' @export
glance.general_clustering <- function(x, ...) {
  tibble::tibble(
    nobs = length(x$cluster),
    n.clusters = length(table(x$cluster[x$cluster != 0L])),
    nexcluded = sum(x$cluster == 0L)
  )
}



================================================
FILE: R/comps.R
================================================
#######################################################################
# dbscan - Density Based Clustering of Applications with Noise
#          and Related Algorithms
# Copyright (C) 2017 Michael Hahsler

# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program; if not, write to the Free Software Foundation, Inc.,
# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

#' Find Connected Components in a Nearest-neighbor Graph
#'
#' Generic function and methods to find connected components in nearest neighbor graphs.
#'
#' Note that for kNN graphs, one point may be in the kNN of the other but nor vice versa.
#' `mutual = TRUE` requires that both points are in each other's kNN.
#'
#' @family NN functions
#' @aliases components
#'
#' @param x the [NN] object representing the graph or a [dist] object
#' @param eps threshold on the distance
#' @param mutual for a pair of points, do both have to be in each other's neighborhood?
#' @param ... further arguments are currently unused.
#'
#' @return an integer vector with component assignments.
#'
#' @author Michael Hahsler
#' @keywords model
#' @examples
#' set.seed(665544)
#' n <- 100
#' x <- cbind(
#'   x=runif(10, 0, 5) + rnorm(n, sd = 0.4),
#'   y=runif(10, 0, 5) + rnorm(n, sd = 0.4)
#'   )
#' plot(x, pch = 16)
#'
#' # Connected components on a graph where each pair of points
#' # with a distance less or equal to eps are connected
#' d <- dist(x)
#' components <- comps(d, eps = .8)
#' plot(x, col = components, pch = 16)
#'
#' # Connected components in a fixed radius nearest neighbor graph
#' # Gives the same result as the threshold on the distances above
#' frnn <- frNN(x, eps = .8)
#' components <- comps(frnn)
#' plot(frnn, data = x, col = components)
#'
#' # Connected components on a k nearest neighbors graph
#' knn <- kNN(x, 3)
#' components <- comps(knn, mutual = FALSE)
#' plot(knn, data = x, col = components)
#'
#' components <- comps(knn, mutual = TRUE)
#' plot(knn, data = x, col = components)
#'
#' # Connected components in a shared nearest neighbor graph
#' snn <- sNN(x, k = 10, kt = 5)
#' components <- comps(snn)
#' plot(snn, data = x, col = components)
#' @export
comps <- function(x, ...) UseMethod("comps", x)

#' @rdname comps
#' @export
comps.dist <- function(x, eps, ...)
  stats::cutree(stats::hclust(x, method = "single"), h = eps)

#' @rdname comps
#' @export
comps.kNN <- function(x, mutual = FALSE, ...)
  as.integer(factor(comps_kNN(x$id, as.logical(mutual))))

# sNN and frNN are symmetric so no need for mutual
#' @rdname comps
#' @export
comps.sNN <- function(x, ...) comps.kNN(x, mutual = FALSE)

#' @rdname comps
#' @export
comps.frNN <- function(x, ...) comps_frNN(x$id, mutual = FALSE)


================================================
FILE: R/dbcv.R
================================================
#######################################################################
# dbscan - Density Based Clustering of Applications with Noise
#          and Related Algorithms
# Copyright (C) 2024 Michael Hahsler, Matt Piekenbrock

# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program; if not, write to the Free Software Foundation, Inc.,
# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.


#' Density-Based Clustering Validation Index (DBCV)
#'
#' Calculate the Density-Based Clustering Validation Index (DBCV)  for a
#' clustering.
#'
#' DBCV (Moulavi et al, 2014) computes a score based on the density sparseness of each cluster
#' and the density separation of each pair of clusters.
#'
#' The density sparseness of a cluster (DSC) is deﬁned as the maximum edge weight of
#' a minimal spanning tree for the internal points of the cluster using the mutual
#' reachability distance based on the all-points-core-distance. Internal points
#' are connected to more than one other point in the cluster. Since clusters of
#' a size less then 3 cannot have internal points, they are ignored (considered
#' noise) in this implementation.
#'
#' The density separation of a pair of clusters (DSPC)
#' is deﬁned as the minimum reachability distance between the internal nodes of
#' the spanning trees of the two clusters.
#'
#' The validity index for a cluster is calculated using these measures and aggregated
#' to a validity index for the whole clustering using a weighted average.
#'
#' The index is in the range \eqn{[-1,1]}. If the cluster density compactness is better
#' than the density separation, a positive value is returned. The actual value depends
#' on the separability of the data. In general, greater values
#' of the measure indicating a better density-based clustering solution.
#'
#' Noise points are included in the calculation only in the weighted average,
#' therefore clustering with more noise points will get a lower index.
#'
#' **Performance note:** This implementation calculates a distance matrix and thus
#' can only be used for small or sampled datasets.
#'
#' @aliases dbcv DBCV
#' @family Evaluation Functions
#'
#' @param x a data matrix or a dist object.
#' @param cl a clustering (e.g., a integer vector)
#' @param d dimensionality of the original data if a dist object is provided.
#' @param metric distance metric used. The available metrics are the methods
#'        implemented by `dist()` plus `"sqeuclidean"` for the squared
#'        Euclidean distance used in the original DBCV implementation.
#' @param sample sample size used for large datasets.
#'
#' @return A list with the DBCV `score` for the clustering,
#'   the density sparseness of cluster (`dsc`) values,
#'   the density separation of pairs of clusters (`dspc`) distances,
#'   and the validity indices of clusters (`c_c`).
#'
#' @author Matt Piekenbrock and Michael Hahsler
#' @references Davoud Moulavi and Pablo A. Jaskowiak and
#' Ricardo J. G. B. Campello and Arthur Zimek and Jörg Sander (2014).
#' Density-Based Clustering Validation. In
#' _Proceedings of the 2014 SIAM International Conference on Data Mining,_
#' pages 839-847
#' \doi{10.1137/1.9781611973440.96}
#'
#' Pablo A. Jaskowiak (2022). MATLAB implementation of DBCV.
#' \url{https://github.com/pajaskowiak/dbcv}
#' @examples
#' # Load a test dataset
#' data(Dataset_1)
#' x <- Dataset_1[, c("x", "y")]
#' class <- Dataset_1$class
#'
#' clplot(x, class)
#'
#' # We use MinPts 3 and use the knee at eps = .1 for dbscan
#' kNNdistplot(x, minPts = 3)
#'
#' cl <- dbscan(x, eps = .1, minPts = 3)
#' clplot(x, cl)
#'
#' dbcv(x, cl)
#'
#' # compare to the DBCV index on the original class labels and
#' # with a random partitioning
#' dbcv(x, class)
#' dbcv(x, sample(1:4, replace = TRUE, size = nrow(x)))
#'
#' # find the best eps using dbcv
#' eps_grid <- seq(.05,.2, by = .01)
#' cls <- lapply(eps_grid, FUN = function(e) dbscan(x, eps = e, minPts = 3))
#' dbcvs <- sapply(cls, FUN = function(cl) dbcv(x, cl)$score)
#'
#' plot(eps_grid, dbcvs, type = "l")
#'
#' eps_opt <- eps_grid[which.max(dbcvs)]
#' eps_opt
#'
#' cl <- dbscan(x, eps = eps_opt, minPts = 3)
#' clplot(x, cl)
#' @export
dbcv <- function(x,
                 cl,
                 d,
                 metric = "euclidean",
                 sample = NULL) {
  # a clustering with a cluster element
  if (is.list(cl)) {
    cl <- cl$cluster
  }

  if (inherits(x, "dist")) {
    xdist <- x
    if (missing(d))
      stop("d needs to be specified if a distance matrix is supplied!")

  } else if (.matrixlike(x)) {
    if (!is.null(sample)) {
      take <- sample(nrow(x), size = sample)
      x <- x[take, ]
      cl <- cl[take]
    }

    x <- as.matrix(x)
    if (!missing(d) && d != ncol(x))
      stop("d does not match the number of columns in x!")
    d <- ncol(x)

    if (pmatch(metric, "sqeuclidean", nomatch = 0))
      xdist <- dist(x, method = "euclidean")^2
    else
      xdist <- dist(x, method = metric)

  } else
    stop("'dbcv' expects x needs to be a matrix to calculate distances.")

  .check_dist(xdist)
  n <- attr(xdist, "Size")

  # in case we get a factor
  cl <- as.integer(cl)

  if (length(cl) != n)
    stop("cl does not match the number of rows in x!")

  ## calculate everything for all non-noise points ordered by cluster
  ## getClusterIdList removes noise points and singleton clusters
  ## and returns indices reorder by cluster
  cl_idx_list <- getClusterIdList(cl)
  n_cl <- length(cl_idx_list)
  ## reordered distances w/o noise
  all_dist <- dist_subset(xdist, unlist(cl_idx_list))

  new_cl_idx_list <- list()
  i <- 1L
  start <- 1
  for(l in lengths(cl_idx_list)) {
    end <- start + l - 1
    new_cl_idx_list[[i]] <- seq(start, end)
    start <- end + 1
    i <- i + 1L
  }

  cl_idx_list <- new_cl_idx_list
  all_idx <- unlist(cl_idx_list)


  ## 1. Calculate all-points-core-distance
  ## Calculate the all-points-core-distance for each point, within each cluster
  ## Note: this needs the dimensionality of the data d
  all_pts_core_dist <- unlist(lapply(
    cl_idx_list,
    FUN = function(ids) {
      dists <- (rowSums(as.matrix((
        1 / dist_subset(all_dist, ids)
      )^d)) / (length(ids) - 1))^(-1 / d)
    }
  ))

  ## 2. Create for each cluster a mutual reachability MSTs
  all_mrd <- structure(mrd(all_dist, all_pts_core_dist),
                       class = "dist",
                       Size = length(all_idx))
  ## Noise points are removed, but the index is affected by dividing by the
  ## total number of objects including the noise points (n)!

  ## mst is a matrix with columns: from to and weight
  mrd_graphs <- lapply(cl_idx_list, function(idx) {
    mst(x_dist = dist_subset(all_mrd, idx), n = length(idx))
  })

  ## 3. Density Sparseness of a Cluster (DSC):
  ## The maximum edge weight of the internal edges in the cluster's
  ## mutual reachability MST.

  ## find internal nodes for DSC and DSPC. Internal nodes have a degree > 1
  internal_nodes <- lapply(mrd_graphs, function(mst) {
    node_deg <- table(c(mst[, 1], mst[, 2]))
    idx <- as.integer(names(node_deg)[node_deg > 1])
    idx
  })

  dsc <- mapply(function(mst, int_idx) {
    # find internal edges
    int_edge_idx <- which((mst[, 1L] %in% int_idx) &
                            (mst[, 2L] %in% int_idx))
    if (length(int_edge_idx) == 0L) {
      return(max(mst[, 3L]))
    }
    max(mst[int_edge_idx, 3L])
  }, mrd_graphs, internal_nodes)


  ## 4. Density Separation of a Pair of Clusters (DSPC):
  ## The minimum reachability distance between the internal nodes of the
  ## internal nodes of a pair of MST_MRD's of clusters Ci and Cj
  dspc_dist <- dspc(cl_idx_list, internal_nodes, all_idx, all_mrd)
  # returns a matrix with Ci, Cj, dist

  # make it into a full distance matrix
  dspc_dist <- dspc_dist[, 3L]
  class(dspc_dist) <- "dist"
  attr(dspc_dist, "Size") <- n_cl
  attr(dspc_dist, "Diag") <- FALSE
  attr(dspc_dist, "Upper") <- FALSE

  dspc_mm <- as.matrix(dspc_dist)
  diag(dspc_mm) <- NA

  ## 5. Validity index of a cluster:
  min_separation <- apply(dspc_mm, MARGIN = 1, min, na.rm = TRUE)
  v_c <- (min_separation - dsc) / pmax(min_separation, dsc)


  ## 5. Validity index for whole clustering
  res <- sum(lengths(cl_idx_list) / n * v_c)

  return(list(
    score = res,
    n = n,
    n_c = lengths(cl_idx_list),
    d = d,
    dsc = dsc,
    dspc = dspc_dist,
    v_c = v_c
  ))
}


getClusterIdList <- function(cl) {
  ## In DBCV, singletons are ambiguously defined. However, they cannot be
  ## considered valid clusters, for reasons listed in section 4 of the
  ## original paper.
  ## Clusters with less then 3 points cannot have internal nodes, so we need to
  ## ignore them as well.
  ## To ensure coverage, they are assigned into the noise category.
  cl_freq <- table(cl)
  cl[cl %in% as.integer(names(which(cl_freq < 3)))] <- 0L
  if (all(cl == 0)) {
    return(0)
  }

  cl_ids <- unique(cl)            # all cluster ids
  cl_valid <- cl_ids[cl_ids != 0] # valid cluster indices (non-noise)
  n_cl <- length(cl_valid)        # number of clusters

  ## 1 or 0 clusters results in worst score + a warning
  if (n_cl <= 1) {
    warning("DBCV is undefined for less than 2 non-noise clusters with more than 2 member points.")
    return(-1L)
  }

  ## Indexes
  cl_ids_idx <- lapply(cl_valid, function(id)
    sort(which(cl == id))) ## the sort is important for indexing purposes
  return(cl_ids_idx)
}


================================================
FILE: R/dbscan.R
================================================
#######################################################################
# dbscan - Density Based Clustering of Applications with Noise
#          and Related Algorithms
# Copyright (C) 2015 Michael Hahsler
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program; if not, write to the Free Software Foundation, Inc.,
# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.


#' Density-based Spatial Clustering of Applications with Noise (DBSCAN)
#'
#' Fast reimplementation of the DBSCAN (Density-based spatial clustering of
#' applications with noise) clustering algorithm using a kd-tree.
#'
#' The
#' implementation is significantly faster and can work with larger data sets
#' than [fpc::dbscan()] in \pkg{fpc}. Use `dbscan::dbscan()` (with specifying the package) to
#' call this implementation when you also load package \pkg{fpc}.
#'
#' **The algorithm**
#'
#' This implementation of DBSCAN follows the original
#' algorithm as described by Ester et al (1996). DBSCAN performs the following steps:
#'
#' 1. Estimate the density
#'   around each data point by counting the number of points in a user-specified
#'   eps-neighborhood and applies a used-specified minPts thresholds to identify
#'      - core points (points with more than minPts points in their neighborhood),
#'      - border points (non-core points with a core point in their neighborhood) and
#'      - noise points (all other points).
#' 2. Core points form the backbone of clusters by joining them into
#'   a cluster if they are density-reachable from each other (i.e., there is a chain of core
#'   points where one falls inside the eps-neighborhood of the next).
#' 3. Border points are assigned to clusters. The algorithm needs parameters
#'   `eps` (the radius of the epsilon neighborhood) and `minPts` (the
#'   density threshold).
#'
#' Border points are arbitrarily assigned to clusters in the original
#' algorithm. DBSCAN* (see Campello et al 2013) treats all border points as
#' noise points. This is implemented with `borderPoints = FALSE`.
#'
#' **Specifying the data**
#'
#' If `x` is a matrix or a data.frame, then fast fixed-radius nearest
#' neighbor computation using a kd-tree is performed using Euclidean distance.
#' See [frNN()] for more information on the parameters related to
#' nearest neighbor search. **Note** that only numerical values are allowed in `x`.
#'
#' Any precomputed distance matrix (dist object) can be specified as `x`.
#' You may run into memory issues since distance matrices are large.
#'
#' A precomputed frNN object can be supplied as `x`. In this case
#' `eps` does not need to be specified. This option us useful for large
#' data sets, where a sparse distance matrix is available. See
#' [frNN()] how to create frNN objects.
#'
#' **Setting parameters for DBSCAN**
#'
#' The parameters `minPts` and `eps` define the minimum density required
#' in the area around core points which form the backbone of clusters.
#' `minPts` is the number of points
#' required in the neighborhood around the point defined by the parameter `eps`
#' (i.e., the radius around the point). Both parameters
#' depend on each other and changing one typically requires changing
#' the other one as well. The parameters also depend on the size of the data set with
#' larger datasets requiring a larger `minPts` or a smaller `eps`.
#'
#' * `minPts:` The original
#' DBSCAN paper (Ester et al, 1996) suggests to start by setting \eqn{\text{minPts} \ge d + 1},
#' the data dimensionality plus one or higher with a minimum of 3. Larger values
#' are preferable since increasing the parameter suppresses more noise in the data
#' by requiring more points to form clusters.
#' Sander et al (1998) uses in the examples two times the data dimensionality.
#' Note that setting \eqn{\text{minPts} \le 2} is equivalent to hierarchical clustering
#' with the single link metric and the dendrogram cut at height `eps`.
#'
#' * `eps:` A suitable neighborhood size
#' parameter `eps` given a fixed value for `minPts` can be found
#' visually by inspecting the [kNNdistplot()] of the data using
#' \eqn{k = \text{minPts} - 1} (`minPts` includes the point itself, while the
#' k-nearest neighbors distance does not). The k-nearest neighbor distance plot
#' sorts all data points by their k-nearest neighbor distance. A sudden
#' increase of the kNN distance (a knee) indicates that the points to the right
#' are most likely outliers. Choose `eps` for DBSCAN where the knee is.
#'
#' **Predict cluster memberships**
#'
#' [predict()] can be used to predict cluster memberships for new data
#' points. A point is considered a member of a cluster if it is within the eps
#' neighborhood of a core point of the cluster. Points
#' which cannot be assigned to a cluster will be reported as
#' noise points (i.e., cluster ID 0).
#' **Important note:** `predict()` currently can only use Euclidean distance to determine
#' the neighborhood of core points. If `dbscan()` was called using distances other than Euclidean,
#' then the neighborhood calculation will not be correct and only approximated by Euclidean
#' distances. If the data contain factor columns (e.g., using Gower's distance), then
#' the factors in `data` and `query` first need to be converted to numeric to use the
#' Euclidean approximation.
#'
#'
#' @aliases dbscan DBSCAN print.dbscan_fast
#' @family clustering functions
#'
#' @param x a data matrix, a data.frame, a [dist] object or a [frNN] object with
#' fixed-radius nearest neighbors.
#' @param eps size (radius) of the epsilon neighborhood. Can be omitted if
#' `x` is a frNN object.
#' @param minPts number of minimum points required in the eps neighborhood for
#' core points (including the point itself).
#' @param weights numeric; weights for the data points. Only needed to perform
#' weighted clustering.
#' @param borderPoints logical; should border points be assigned to clusters.
#' The default is `TRUE` for regular DBSCAN. If `FALSE` then border
#' points are considered noise (see DBSCAN* in Campello et al, 2013).
#' @param ...  additional arguments are passed on to the fixed-radius nearest
#' neighbor search algorithm. See [frNN()] for details on how to
#' control the search strategy.
#'
#' @return `dbscan()` returns an object of class `dbscan_fast` with the following components:
#'
#' \item{eps }{ value of the `eps` parameter.}
#' \item{minPts }{ value of the `minPts` parameter.}
#' \item{metric }{ used distance metric.}
#' \item{cluster }{A integer vector with cluster assignments. Zero indicates noise points.}
#'
#' `is.corepoint()` returns a logical vector indicating for each data point if it is a
#'   core point.
#'
#' @author Michael Hahsler
#' @references Hahsler M, Piekenbrock M, Doran D (2019). dbscan: Fast
#' Density-Based Clustering with R.  _Journal of Statistical Software,_
#' 91(1), 1-30.
#' \doi{10.18637/jss.v091.i01}
#'
#' Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu (1996). A
#' Density-Based Algorithm for Discovering Clusters in Large Spatial Databases
#' with Noise. Institute for Computer Science, University of Munich.
#' _Proceedings of 2nd International Conference on Knowledge Discovery and
#' Data Mining (KDD-96),_ 226-231.
#' \url{https://dl.acm.org/doi/10.5555/3001460.3001507}
#'
#' Campello, R. J. G. B.; Moulavi, D.; Sander, J. (2013). Density-Based
#' Clustering Based on Hierarchical Density Estimates. Proceedings of the
#' 17th Pacific-Asia Conference on Knowledge Discovery in Databases, PAKDD
#' 2013, _Lecture Notes in Computer Science_ 7819, p. 160.
#' \doi{10.1007/978-3-642-37456-2_14}
#'
#' Sander, J., Ester, M., Kriegel, HP. et al. (1998). Density-Based
#' Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications.
#' _Data Mining and Knowledge Discovery_ 2, 169-194.
#' \doi{10.1023/A:1009745219419}
#'
#' @keywords model clustering
#' @examples
#' ## Example 1: use dbscan on the iris data set
#' data(iris)
#' iris <- as.matrix(iris[, 1:4])
#'
#' ## Find suitable DBSCAN parameters:
#' ## 1. We use minPts = dim + 1 = 5 for iris. A larger value can also be used.
#' ## 2. We inspect the k-NN distance plot for k = minPts - 1 = 4
#' kNNdistplot(iris, minPts = 5)
#'
#' ## Noise seems to start around a 4-NN distance of .7
#' abline(h=.7, col = "red", lty = 2)
#'
#' ## Cluster with the chosen parameters
#' res <- dbscan(iris, eps = .7, minPts = 5)
#' res
#'
#' pairs(iris, col = res$cluster + 1L)
#' clplot(iris, res)
#'
#' ## Use a precomputed frNN object
#' fr <- frNN(iris, eps = .7)
#' dbscan(fr, minPts = 5)
#'
#' ## Example 2: use data from fpc
#' set.seed(665544)
#' n <- 100
#' x <- cbind(
#'   x = runif(10, 0, 10) + rnorm(n, sd = 0.2),
#'   y = runif(10, 0, 10) + rnorm(n, sd = 0.2)
#'   )
#'
#' res <- dbscan(x, eps = .3, minPts = 3)
#' res
#'
#' ## plot clusters and add noise (cluster 0) as crosses.
#' plot(x, col = res$cluster)
#' points(x[res$cluster == 0, ], pch = 3, col = "grey")
#'
#' clplot(x, res)
#' hullplot(x, res)
#'
#' ## Predict cluster membership for new data points
#' ## (Note: 0 means it is predicted as noise)
#' newdata <- x[1:5,] + rnorm(10, 0, .3)
#' hullplot(x, res)
#' points(newdata, pch = 3 , col = "red", lwd = 3)
#' text(newdata, pos = 1)
#'
#' pred_label <- predict(res, newdata, data = x)
#' pred_label
#' points(newdata, col = pred_label + 1L,  cex = 2, lwd = 2)
#'
#' ## Compare speed against fpc version (if microbenchmark is installed)
#' ## Note: we use dbscan::dbscan to make sure that we do now run the
#' ## implementation in fpc.
#' \dontrun{
#' if (requireNamespace("fpc", quietly = TRUE) &&
#'     requireNamespace("microbenchmark", quietly = TRUE)) {
#'   t_dbscan <- microbenchmark::microbenchmark(
#'     dbscan::dbscan(x, .3, 3), times = 10, unit = "ms")
#'   t_dbscan_linear <- microbenchmark::microbenchmark(
#'     dbscan::dbscan(x, .3, 3, search = "linear"), times = 10, unit = "ms")
#'   t_dbscan_dist <- microbenchmark::microbenchmark(
#'     dbscan::dbscan(x, .3, 3, search = "dist"), times = 10, unit = "ms")
#'   t_fpc <- microbenchmark::microbenchmark(
#'     fpc::dbscan(x, .3, 3), times = 10, unit = "ms")
#'
#'   r <- rbind(t_fpc, t_dbscan_dist, t_dbscan_linear, t_dbscan)
#'   r
#'
#'   boxplot(r,
#'     names = c('fpc', 'dbscan (dist)', 'dbscan (linear)', 'dbscan (kdtree)'),
#'     main = "Runtime comparison in ms")
#'
#'   ## speedup of the kd-tree-based version compared to the fpc implementation
#'   median(t_fpc$time) / median(t_dbscan$time)
#' }}
#'
#' ## Example 3: manually create a frNN object for dbscan (dbscan only needs ids and eps)
#' nn <- structure(list(id = list(c(2,3), c(1,3), c(1,2,3), c(3,5), c(4,5)), eps = 1),
#'   class =  c("NN", "frNN"))
#' nn
#' dbscan(nn, minPts = 2)
#'
#' @export
dbscan <-
  function(x,
    eps,
    minPts = 5,
    weights = NULL,
    borderPoints = TRUE,
    ...) {
    if (inherits(x, "frNN") && missing(eps)) {
      eps <- x$eps
      dist_method <- x$metric
    }

    if (inherits(x, "dist")) {
      .check_dist(x)
      dist_method <- attr(x, "method")
    } else
      dist_method <- "euclidean"

    dist_method <- dist_method %||% "unknown"

    ### extra contains settings for frNN
    ### search = "kdtree", bucketSize = 10, splitRule = "suggest", approx = 0
    ### also check for MinPts for fpc compatibility (does not work for
    ### search method dist)
    extra <- list(...)
    args <-
      c("MinPts", "search", "bucketSize", "splitRule", "approx")
    m <- pmatch(names(extra), args)
    if (anyNA(m))
      stop("Unknown parameter: ",
        toString(names(extra)[is.na(m)]))
    names(extra) <- args[m]

    # fpc compartability
    if (!is.null(extra$MinPts)) {
      warning("converting argument MinPts (fpc) to minPts (dbscan)!")
      minPts <- extra$MinPts
      extra$MinPts <- NULL
    }

    search <- .parse_search(extra$search %||% "kdtree")
    splitRule <- .parse_splitRule(extra$splitRule %||% "suggest")
    bucketSize <- as.integer(extra$bucketSize %||% 10L)
    approx <- as.integer(extra$approx %||% 0L)

    ### do dist search
    if (search == 3L && !inherits(x, "dist")) {
      if (.matrixlike(x))
        x <- dist(x)
      else
        stop("x needs to be a matrix to calculate distances")
    }

    ## for dist we provide the R code with a frNN list and no x
    frNN <- list()
    if (inherits(x, "dist")) {
      frNN <- frNN(x, eps, ...)$id
      x <- matrix(0.0, nrow = 0, ncol = 0)
    } else if (inherits(x, "frNN")) {
      if (x$eps != eps) {
        eps <- x$eps
        warning("Using the eps of ",
          eps,
          " provided in the fixed-radius NN object.")
      }
      frNN <- x$id
      x <- matrix(0.0, nrow = 0, ncol = 0)

    } else {
      if (!.matrixlike(x))
        stop("x needs to be a matrix or data.frame.")
      ## make sure x is numeric
      x <- as.matrix(x)
      if (storage.mode(x) == "integer")
        storage.mode(x) <- "double"
      if (storage.mode(x) != "double")
        stop("all data in x has to be numeric.")
    }

    if (length(frNN) == 0 && anyNA(x))
      stop("data/distances cannot contain NAs for dbscan (with kd-tree)!")

    ## add self match and use C numbering if frNN is used
    if (length(frNN) > 0L)
      frNN <-
      lapply(
        seq_along(frNN),
        FUN = function(i)
          c(i - 1L, frNN[[i]] - 1L)
      )

    if (length(minPts) != 1L ||
        !is.finite(minPts) ||
        minPts < 0)
      stop("minPts need to be a single integer >=0.")

    if (is.null(eps) ||
        is.na(eps) || eps < 0)
      stop("eps needs to be >=0.")

    ret <- dbscan_int(
      x,
      as.double(eps),
      as.integer(minPts),
      as.double(weights),
      as.integer(borderPoints),
      as.integer(search),
      as.integer(bucketSize),
      as.integer(splitRule),
      as.double(approx),
      frNN
    )

    structure(
      list(
        cluster = ret,
        eps = eps,
        minPts = minPts,
        metric = dist_method,
        borderPoints = borderPoints
      ),
      class = c("dbscan_fast", "dbscan")
    )
  }

#' @export
print.dbscan_fast <- function(x, ...) {
  writeLines(c(
    paste0("DBSCAN clustering for ", nobs(x), " objects."),
    paste0("Parameters: eps = ", x$eps, ", minPts = ", x$minPts),
    paste0(
      "Using ",
      x$metric,
      " distances and borderpoints = ",
      x$borderPoints
    ),
    paste0(
      "The clustering contains ",
      ncluster(x),
      " cluster(s) and ",
      nnoise(x),
      " noise points."
    )
  ))

  print(table(x$cluster))
  cat("\n")

  writeLines(strwrap(paste0(
    "Available fields: ",
    toString(names(x))
  ), exdent = 18))
}

#' @rdname dbscan
#' @export
is.corepoint <- function(x, eps, minPts = 5, ...)
  lengths(frNN(x, eps = eps, ...)$id) >= (minPts - 1)


================================================
FILE: R/dendrogram.R
================================================
#######################################################################
# dbscan - Density Based Clustering of Applications with Noise
#          and Related Algorithms
# Copyright (C) 2015 Michael Hahsler, Matt Piekenbrock

# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program; if not, write to the Free Software Foundation, Inc.,
# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

#' Coersions to Dendrogram
#'
#' Provides a new generic function to coerce objects to dendrograms with
#' [stats::as.dendrogram()] as the default. Additional methods for
#' [hclust], [hdbscan] and [reachability] objects are provided.
#'
#' Coersion methods for
#' [hclust], [hdbscan] and [reachability] objects to [dendrogram] are provided.
#'
#' The coercion from `hclust` is a faster C++ reimplementation of the coercion in
#' package `stats`. The original implementation can be called
#' using [stats::as.dendrogram()].
#'
#' The coersion from [hdbscan] builds the non-simplified HDBSCAN hierarchy as a
#' dendrogram object.
#'
#' @name dendrogram
#' @aliases dendrogram
#'
#' @param object the object
#' @param ... further arguments
NULL

#' @rdname dendrogram
#' @export
as.dendrogram <- function (object, ...) {
  UseMethod("as.dendrogram", object)
}

#' @rdname dendrogram
#' @export
as.dendrogram.default <- function (object, ...)
  stats::as.dendrogram(object, ...)

## this is a replacement for stats::as.dendrogram for hclust
#' @rdname dendrogram
#' @export
as.dendrogram.hclust <- function(object, ...) {
  return(buildDendrogram(object))
}

#' @rdname dendrogram
#' @export
as.dendrogram.hdbscan <- function(object, ...) {
  return(buildDendrogram(object$hc))
}

#' @rdname dendrogram
#' @export
as.dendrogram.reachability <- function(object, ...) {
  if (sum(is.infinite(object$reachdist)) > 1)
    stop(
      "Multiple Infinite reachability distances found. Reachability plots can only be converted if they contain enough information to fully represent the dendrogram structure. If using OPTICS, a larger eps value (such as Inf) may be needed in the parameterization."
    )
  #dup_x <- object
  c_order <- order(object$reachdist) - 1
  # dup_x$order <- dup_x$order - 1
  #q_order <- sapply(c_order, function(i) which(dup_x$order == i))
  res <- reach_to_dendrogram(object, c_order)
  # res <- dendrapply(res, function(leaf) { new_leaf <- leaf[[1]]; attributes(new_leaf) <- attributes(leaf); new_leaf })

  # add mid points for plotting
  res <- .midcache.dendrogram(res)

  res
}

# calculate midpoints for dendrogram
# from stats, but not exported
# see stats:::midcache.dendrogram

.midcache.dendrogram <- function(x, type = "hclust", quiet = FALSE) {
  type <- match.arg(type)
  stopifnot(inherits(x, "dendrogram"))
  verbose <- getOption("verbose", 0) >= 2
  setmid <- function(d, type) {
    depth <- 0L
    kk <- integer()
    jj <- integer()
    dd <- list()
    repeat {
      if (!is.leaf(d)) {
        k <- length(d)
        if (k < 1)
          stop("dendrogram node with non-positive #{branches}")
        depth <- depth + 1L
        if (verbose)
          cat(sprintf(" depth(+)=%4d, k=%d\n", depth,
            k))
        kk[depth] <- k
        if (storage.mode(jj) != storage.mode(kk))
          storage.mode(jj) <- storage.mode(kk)
        dd[[depth]] <- d
        d <- d[[jj[depth] <- 1L]]
        next
      }
      while (depth) {
        k <- kk[depth]
        j <- jj[depth]
        r <- dd[[depth]]
        r[[j]] <- unclass(d)
        if (j < k)
          break
        depth <- depth - 1L
        if (verbose)
          cat(sprintf(" depth(-)=%4d, k=%d\n", depth,
            k))
        midS <- sum(vapply(r, .midDend, 0))
        if (!quiet && type == "hclust" && k != 2)
          warning("midcache() of non-binary dendrograms only partly implemented")
        attr(r, "midpoint") <- (.memberDend(r[[1L]]) +
            midS) / 2
        d <- r
      }
      if (!depth)
        break
      dd[[depth]] <- r
      d <- r[[jj[depth] <- j + 1L]]
    }
    d
  }
  setmid(x, type = type)
}

.midDend <- function(x) {
  attr(x, "midpoint") %||% 0
}

.memberDend <- function(x) {
  attr(x, "x.member") %||% attr(x, "members") %||% 1
}


================================================
FILE: R/extractFOSC.R
================================================
#######################################################################
# dbscan - Density Based Clustering of Applications with Noise
#          and Related Algorithms
# Copyright (C) 2015 Michael Hahsler, Matt Piekenbrock

# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program; if not, write to the Free Software Foundation, Inc.,
# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

#' Framework for the Optimal Extraction of Clusters from Hierarchies
#'
#' Generic reimplementation of the _Framework for Optimal Selection of Clusters_
#' (FOSC; Campello et al, 2013) to extract clusterings from hierarchical clustering (i.e.,
#' [hclust] objects).
#' Can be parameterized to perform unsupervised
#' cluster extraction through a stability-based measure, or semisupervised
#' cluster extraction through either a constraint-based extraction (with a
#' stability-based tiebreaker) or a mixed (weighted) constraint and
#' stability-based objective extraction.
#'
#' Campello et al (2013) suggested a _Framework for Optimal Selection of
#' Clusters_ (FOSC) as a framework to make local (non-horizontal) cuts to any
#' cluster tree hierarchy. This function implements the original extraction
#' algorithms as described by the framework for hclust objects. Traditional
#' cluster extraction methods from hierarchical representations (such as
#' [hclust] objects) generally rely on global parameters or cutting values
#' which are used to partition a cluster hierarchy into a set of disjoint, flat
#' clusters. This is implemented in R in function [stats::cutree()].
#' Although such methods are widespread, using global parameter
#' settings are inherently limited in that they cannot capture patterns within
#' the cluster hierarchy at varying _local_ levels of granularity.
#'
#' Rather than partitioning a hierarchy based on the number of the cluster one
#' expects to find (\eqn{k}) or based on some linkage distance threshold
#' (\eqn{H}), the FOSC proposes that the optimal clusters may exist at varying
#' distance thresholds in the hierarchy. To enable this idea, FOSC requires one
#' parameter (minPts) that represents _the minimum number of points that
#' constitute a valid cluster._ The first step of the FOSC algorithm is to
#' traverse the given cluster hierarchy divisively, recording new clusters at
#' each split if both branches represent more than or equal to minPts. Branches
#' that contain less than minPts points at one or both branches inherit the
#' parent clusters identity. Note that using FOSC, due to the constraint that
#' minPts must be greater than or equal to 2, it is possible that the optimal
#' cluster solution chosen makes local cuts that render parent branches of
#' sizes less than minPts as noise, which are denoted as 0 in the final
#' solution.
#'
#' Traversing the original cluster tree using minPts creates a new, simplified
#' cluster tree that is then post-processed recursively to extract clusters
#' that maximize for each cluster \eqn{C_i}{Ci} the cost function
#'
#' \deqn{\max_{\delta_2, \dots, \delta_k} J = \sum\limits_{i=2}^{k} \delta_i
#' S(C_i)}{ J = \sum \delta S(Ci) for all i clusters, } where
#' \eqn{S(C_i)}{S(Ci)} is the stability-based measure as \deqn{ S(C_i) =
#' \sum_{x_j \in C_i}(\frac{1}{h_{min} (x_j, C_i)} - \frac{1}{h_{max} (C_i)})
#' }{ S(Ci) = \sum (1/Hmin(Xj, Ci) - 1/Hmax(Ci)) for all Xj in Ci.}
#'
#' \eqn{\delta_i}{\delta} represents an indicator function, which constrains
#' the solution space such that clusters must be disjoint (cannot assign more
#' than 1 label to each cluster). The measure \eqn{S(C_i)}{S(Ci)} used by FOSC
#' is an unsupervised validation measure based on the assumption that, if you
#' vary the linkage/distance threshold across all possible values, more
#' prominent clusters that survive over many threshold variations should be
#' considered as stronger candidates of the optimal solution. For this reason,
#' using this measure to detect clusters is referred to as an unsupervised,
#' _stability-based_ extraction approach. In some cases it may be useful
#' to enact _instance-level_ constraints that ensure the solution space
#' conforms to linkage expectations known _a priori_. This general idea of
#' using preliminary expectations to augment the clustering solution will be
#' referred to as _semisupervised clustering_. If constraints are given in
#' the call to `extractFOSC()`, the following alternative objective function
#' is maximized:
#'
#' \deqn{J = \frac{1}{2n_c}\sum\limits_{j=1}^n \gamma (x_j)}{J = 1/(2 * nc)
#' \sum \gamma(Xj)}
#'
#' \eqn{n_c}{nc} is the total number of constraints given and
#' \eqn{\gamma(x_j)}{\gamma(Xj)} represents the number of constraints involving
#' object \eqn{x_j}{Xj} that are satisfied. In the case of ties (such as
#' solutions where no constraints were given), the unsupervised solution is
#' used as a tiebreaker. See Campello et al (2013) for more details.
#'
#' As a third option, if one wishes to prioritize the degree at which the
#' unsupervised and semisupervised solutions contribute to the overall optimal
#' solution, the parameter \eqn{\alpha} can be set to enable the extraction of
#' clusters that maximize the `mixed` objective function
#'
#' \deqn{J = \alpha S(C_i) + (1 - \alpha) \gamma(C_i))}{J = \alpha S(Ci) + (1 -
#' \alpha) \gamma(Ci).}
#'
#' FOSC expects the pairwise constraints to be passed as either 1) an
#' \eqn{n(n-1)/2} vector of integers representing the constraints, where 1
#' represents should-link, -1 represents should-not-link, and 0 represents no
#' preference using the unsupervised solution (see below for examples).
#' Alternatively, if only a few constraints are needed, a named list
#' representing the (symmetric) adjacency list can be used, where the names
#' correspond to indices of the points in the original data, and the values
#' correspond to integer vectors of constraints (positive indices for
#' should-link, negative indices for should-not-link). Again, see the examples
#' section for a demonstration of this.
#'
#' The parameters to the input function correspond to the concepts discussed
#' above. The `minPts` parameter to represent the minimum cluster size to
#' extract. The optional `constraints` parameter contains the pairwise,
#' instance-level constraints of the data. The optional `alpha` parameters
#' controls whether the mixed objective function is used (if `alpha` is
#' greater than 0). If the `validate_constraints` parameter is set to
#' true, the constraints are checked (and fixed) for symmetry (if point A has a
#' should-link constraint with point B, point B should also have the same
#' constraint). Asymmetric constraints are not supported.
#'
#' Unstable branch pruning was not discussed by Campello et al (2013), however
#' in some data sets it may be the case that specific subbranches scores are
#' significantly greater than sibling and parent branches, and thus sibling
#' branches should be considered as noise if their scores are cumulatively
#' lower than the parents. This can happen in extremely nonhomogeneous data
#' sets, where there exists locally very stable branches surrounded by unstable
#' branches that contain more than `minPts` points.
#' `prune_unstable = TRUE` will remove the unstable branches.
#'
#' @family clustering functions
#'
#' @param x a valid [hclust] object created via [hclust()] or [hdbscan()].
#' @param constraints Either a list or matrix of pairwise constraints. If
#' missing, an unsupervised measure of stability is used to make local cuts and
#' extract the optimal clusters. See details.
#' @param alpha numeric; weight between \eqn{[0, 1]} for mixed-objective
#' semi-supervised extraction. Defaults to 0.
#' @param minPts numeric; Defaults to 2. Only needed if class-less noise is a
#' valid label in the model.
#' @param prune_unstable logical; should significantly unstable subtrees be
#' pruned? The default is `FALSE` for the original optimal extraction
#' framework (see Campello et al, 2013). See details for what `TRUE`
#' implies.
#' @param validate_constraints logical; should constraints be checked for
#' validity? See details for what are considered valid constraints.
#'
#' @returns A list with the elements:
#'
#' \item{cluster }{A integer vector with cluster assignments. Zero
#' indicates noise points (if any).}
#' \item{hc }{The original [hclust] object with additional list elements
#' `"stability"`, `"constraint"`, and `"total"`
#' for the \eqn{n - 1} cluster-wide objective scores from the extraction.}
#'
#' @author Matt Piekenbrock
#' @seealso [hclust()], [hdbscan()], [stats::cutree()]
#' @references Campello, Ricardo JGB, Davoud Moulavi, Arthur Zimek, and Joerg
#' Sander (2013). A framework for semi-supervised and unsupervised optimal
#' extraction of clusters from hierarchies. _Data Mining and Knowledge
#' Discovery_ 27(3): 344-371.
#' \doi{10.1007/s10618-013-0311-4}
#' @keywords model clustering
#' @examples
#' data("moons")
#'
#' ## Regular HDBSCAN using stability-based extraction (unsupervised)
#' cl <- hdbscan(moons, minPts = 5)
#' cl$cluster
#'
#' ## Constraint-based extraction from the HDBSCAN hierarchy
#' ## (w/ stability-based tiebreaker (semisupervised))
#' cl_con <- extractFOSC(cl$hc, minPts = 5,
#'   constraints = list("12" = c(49, -47)))
#' cl_con$cluster
#'
#' ## Alternative formulation: Constraint-based extraction from the HDBSCAN hierarchy
#' ## (w/ stability-based tiebreaker (semisupervised)) using distance thresholds
#' dist_moons <- dist(moons)
#' cl_con2 <- extractFOSC(cl$hc, minPts = 5,
#'   constraints = ifelse(dist_moons < 0.1, 1L,
#'                 ifelse(dist_moons > 1, -1L, 0L)))
#'
#' cl_con2$cluster # same as the second example
#' @export
extractFOSC <-
  function(x,
    constraints,
    alpha = 0,
    minPts = 2L,
    prune_unstable = FALSE,
    validate_constraints = FALSE) {
    if (!inherits(x, "hclust"))
      stop("extractFOSC expects 'x' to be a valid hclust object.")

    # if constraints are given then they need to be a list, a matrix or a vector
    if (!(
      missing(constraints) ||
        is.list(constraints) ||
        is.matrix(constraints) ||
        is.numeric(constraints)
    ))
      stop("extractFOSC expects constraints to be either an adjacency list or adjacency matrix.")

    if (!minPts >= 2)
      stop("minPts must be at least 2.")
    if (alpha < 0 ||
        alpha > 1)
      stop("alpha can only takes values between [0, 1].")
    n <- nrow(x$merge) + 1L

    ## First step for both unsupervised and semisupervised - compute stability scores
    cl_tree <- computeStability(x, minPts)

    ## Unsupervised Extraction
    if (missing(constraints)) {
      cl_tree <- extractUnsupervised(cl_tree, prune_unstable)
    }
    ## Semi-supervised Extraction
    else {
      ## If given as adjacency-list form
      if (is.list(constraints)) {
        ## Checks for proper indexing, symmetry of constraints, etc.
        if (validate_constraints) {
          is_valid <- max(as.integer(names(constraints))) < n
          is_valid <- is_valid &&
            all(vapply(constraints, function(ilc) all(ilc <= n), logical(1L)))
          if (!is_valid) {
            stop("Detected constraint indices not in the interval [1, n]")
          }
          constraints <- validateConstraintList(constraints, n)
        }
        cl_tree <-
          extractSemiSupervised(cl_tree, constraints, alpha, prune_unstable)
      }
      ## Adjacency matrix given (probably from dist object), retrieve adjacency list form
      else if (is.vector(constraints)) {
        if (!all(constraints %in% c(-1, 0, 1))) {
          stop(
            "'extractFOSC' only accepts instance-level constraints. See ?extractFOSC for more details."
          )
        }
        ## Checks for proper integer labels, symmetry of constraints, length of vector, etc.
        if (validate_constraints) {
          is_valid <- length(constraints) == choose(n, 2)
          constraints_list <-
            validateConstraintList(distToAdjacency(constraints, n), n)
        } else {
          constraints_list <-  distToAdjacency(constraints, n)
        }
        cl_tree <-
          extractSemiSupervised(cl_tree, constraints_list, alpha, prune_unstable)
      }
      ## Full nxn adjacency-matrix given, give warning and retrieve adjacency list form
      else if (is.matrix(constraints)) {
        if (!all(constraints %in% c(-1, 0, 1))) {
          stop(
            "'extractFOSC' only accepts instance-level constraints. See ?extractFOSC for more details."
          )
        }
        if (!all(dim(constraints) == c(n, n))) {
          stop("Given matrix is not square.")
        }
        warning(
          "Full nxn matrix given; extractFOCS does not support asymmetric relational constraints. Using lower triangular."
        )

        constraints <- constraints[lower.tri(constraints)]

        ## Checks for proper integer labels, symmetry of constraints, length of vector, etc.
        if (validate_constraints) {
          is_valid <- length(constraints) == choose(n, 2)
          constraints_list <-
            validateConstraintList(distToAdjacency(constraints, n), n)
        } else {
          constraints_list <- distToAdjacency(constraints, n)
        }
        cl_tree <-
          extractSemiSupervised(cl_tree, constraints_list, alpha, prune_unstable)
      } else {
        stop(
          "'extractFOSC' doesn't know how to handle constraints of type ",
          class(constraints)
        )
      }
    }
    cl_track <- attr(cl_tree, "cl_tracker")
    stability_score <-
      vapply(cl_track, function(cid)
        cl_tree[[as.character(cid)]]$stability, numeric(1L))
    constraint_score <-
      vapply(cl_track, function(cid)
        cl_tree[[as.character(cid)]]$vscore %||% 0, numeric(1L))
    total_score <-
      vapply(cl_track, function(cid)
        cl_tree[[as.character(cid)]]$score %||% 0, numeric(1L))
    out <- append(
      x,
      list(
        cluster = cl_track,
        stability = stability_score,
        constraint = constraint_score,
        total = total_score
      )
    )
    extraction_type <-
      if (missing(constraints)) {
        "(w/ stability-based extraction)"
      } else if (alpha == 0) {
        "(w/ constraint-based extraction)"
      } else {
        "(w/ mixed-objective extraction)"
      }
    substrs <- strsplit(x$method, split = " \\(w\\/")[[1L]]
    out[["method"]] <-
      if (length(substrs) > 1)
        paste(substrs[[1]], extraction_type)
    else
      paste(out[["method"]], extraction_type)
    class(out) <- "hclust"
    return(list(cluster = attr(cl_tree, "cluster"), hc = out))
  }


================================================
FILE: R/frNN.R
================================================
#######################################################################
# dbscan - Density Based Clustering of Applications with Noise
#          and Related Algorithms
# Copyright (C) 2015 Michael Hahsler

# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program; if not, write to the Free Software Foundation, Inc.,
# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.


#' Find the Fixed Radius Nearest Neighbors
#'
#' This function uses a kd-tree to find the fixed radius nearest neighbors
#' (including distances) fast.
#'
#' If `x` is specified as a data matrix, then Euclidean distances an fast
#' nearest neighbor lookup using a kd-tree are used.
#'
#' To create a frNN object from scratch, you need to supply at least the
#' elements `id` with a list of integer vectors with the nearest neighbor
#' ids for each point and `eps` (see below).
#'
#' **Self-matches:** Self-matches are not returned!
#'
#' @aliases frNN frnn print.frnn
#' @family NN functions
#'
#' @param x a data matrix, a dist object or a frNN object.
#' @param eps neighbors radius.
#' @param query a data matrix with the points to query. If query is not
#' specified, the NN for all the points in `x` is returned. If query is
#' specified then `x` needs to be a data matrix.
#' @param sort sort the neighbors by distance? This is expensive and can be
#' done later using `sort()`.
#' @param search nearest neighbor search strategy (one of `"kdtree"`, `"linear"` or
#' `"dist"`).
#' @param bucketSize max size of the kd-tree leafs.
#' @param splitRule rule to split the kd-tree. One of `"STD"`, `"MIDPT"`, `"FAIR"`,
#' `"SL_MIDPT"`, `"SL_FAIR"` or `"SUGGEST"` (SL stands for sliding). `"SUGGEST"` uses
#' ANNs best guess.
#' @param approx use approximate nearest neighbors. All NN up to a distance of
#' a factor of `1 + approx` eps may be used. Some actual NN may be omitted
#' leading to spurious clusters and noise points.  However, the algorithm will
#' enjoy a significant speedup.
#' @param decreasing sort in decreasing order?
#' @param ... further arguments
#'
#' @returns
#'
#' `frNN()` returns an object of class [frNN] (subclass of
#' [NN]) containing a list with the following components:
#' \item{id }{a list of
#' integer vectors. Each vector contains the ids (row numbers) of the fixed radius nearest
#' neighbors. }
#' \item{dist }{a list with distances (same structure as
#' `id`). }
#' \item{eps }{ neighborhood radius `eps` that was used. }
#' \item{metric }{ used distance metric. }
#'
#' `adjacencylist()` returns a list with one entry per data point in `x`. Each entry
#' contains the id of the nearest neighbors.
#'
#' @author Michael Hahsler
#'
#' @references David M. Mount and Sunil Arya (2010). ANN: A Library for
#' Approximate Nearest Neighbor Searching,
#' \url{http://www.cs.umd.edu/~mount/ANN/}.
#' @keywords model
#' @examples
#' data(iris)
#' x <- iris[, -5]
#'
#' # Example 1: Find fixed radius nearest neighbors for each point
#' nn <- frNN(x, eps = .5)
#' nn
#'
#' # Number of neighbors
#' hist(lengths(adjacencylist(nn)),
#'   xlab = "k", main="Number of Neighbors",
#'   sub = paste("Neighborhood size eps =", nn$eps))
#'
#' # Explore neighbors of point i = 10
#' i <- 10
#' nn$id[[i]]
#' nn$dist[[i]]
#' plot(x, col = ifelse(seq_len(nrow(iris)) %in% nn$id[[i]], "red", "black"))
#'
#' # get an adjacency list
#' head(adjacencylist(nn))
#'
#' # plot the fixed radius neighbors (and then reduced to a radius of .3)
#' plot(nn, x)
#' plot(frNN(nn, eps = .3), x)
#'
#' ## Example 2: find fixed-radius NN for query points
#' q <- x[c(1,100),]
#' nn <- frNN(x, eps = .5, query = q)
#'
#' plot(nn, x, col = "grey")
#' points(q, pch = 3, lwd = 2)
#' @export frNN
frNN <-
  function(x,
    eps,
    query = NULL,
    sort = TRUE,
    search = "kdtree",
    bucketSize = 10,
    splitRule = "suggest",
    approx = 0) {
    if (is.null(eps) ||
        is.na(eps) || eps < 0)
      stop("eps needs to be >=0.")

    if (inherits(x, "frNN")) {
      if (x$eps < eps)
        stop("frNN in x has not a sufficient eps radius.")

      for (i in seq_along(x$dist)) {
        take <- x$dist[[i]] <= eps
        x$dist[[i]] <- x$dist[[i]][take]
        x$id[[i]] <- x$id[[i]][take]
      }
      x$eps <- eps

      return(x)
    }

    search <- .parse_search(search)
    splitRule <- .parse_splitRule(splitRule)

    ### dist search
    if (search == 3 && !inherits(x, "dist")) {
      if (.matrixlike(x))
        x <- dist(x)
      else
        stop("x needs to be a matrix to calculate distances")
    }

    ### get kNN from a dist object in R
    if (inherits(x, "dist")) {
      if (!is.null(query))
        stop("query can only be used if x contains the data.")

      if (anyNA(x))
        stop("data/distances cannot contain NAs for frNN (with kd-tree)!")

      return(dist_to_frNN(x, eps = eps, sort = sort))
    }

    ## make sure x is numeric
    if (!.matrixlike(x))
      stop("x needs to be a matrix or a data.frame.")
    x <- as.matrix(x)
    if (storage.mode(x) == "integer")
      storage.mode(x) <- "double"
    if (storage.mode(x) != "double")
      stop("all data in x has to be numeric.")

    if (!is.null(query)) {
      if (!.matrixlike(query))
        stop("query needs to be a matrix or a data.frame.")
      query <- as.matrix(query)
      if (storage.mode(query) == "integer")
        storage.mode(query) <- "double"
      if (storage.mode(query) != "double")
        stop("query has to be NULL or a numeric matrix or data.frame.")
      if (ncol(x) != ncol(query))
        stop("x and query need to have the same number of columns!")
    }

    if (anyNA(x))
      stop("data/distances cannot contain NAs for frNN (with kd-tree)!")

    ## returns NO self matches
    if (!is.null(query)) {
      ret <-
        frNN_query_int(
          as.matrix(x),
          as.matrix(query),
          as.double(eps),
          as.integer(search),
          as.integer(bucketSize),
          as.integer(splitRule),
          as.double(approx)
        )
      names(ret$dist) <- rownames(query)
      names(ret$id) <- rownames(query)
      ret$metric <- "euclidean"
    } else {
      ret <- frNN_int(
        as.matrix(x),
        as.double(eps),
        as.integer(search),
        as.integer(bucketSize),
        as.integer(splitRule),
        as.double(approx)
      )
      names(ret$dist) <- rownames(x)
      names(ret$id) <- rownames(x)
      ret$metric <- "euclidean"
    }

    ret$eps <- eps
    ret$sort <- FALSE
    class(ret) <- c("frNN", "NN")

    if (sort)
      ret <- sort.frNN(ret)

    ret
  }

# extract a row from a distance matrix without doubling space requirements
dist_row <- function(x, i, self_val = 0) {
  n <- attr(x, "Size")

  i <- rep(i, times = n)
  j <- seq_len(n)
  swap_idx <- i > j
  tmp <- i[swap_idx]
  i[swap_idx] <- j[swap_idx]
  j[swap_idx] <- tmp

  diag_idx <- i == j
  idx <- n * (i - 1) - i * (i - 1) / 2 + j - i
  idx[diag_idx] <- NA

  val <- x[idx]
  val[diag_idx] <- self_val
  val
}

dist_to_frNN <- function(x, eps, sort = FALSE) {
  .check_dist(x)

  n <- attr(x, "Size")

  id <- list()
  d <- list()

  for (i in seq_len(n)) {
    ### Inf -> no self-matches
    y <- dist_row(x, i, self_val = Inf)
    o <- which(y <= eps)
    id[[i]] <- o
    d[[i]] <- y[o]
  }
  names(id) <- labels(x)
  names(d) <- labels(x)

  ret <-
    structure(list(
      dist = d,
      id = id,
      eps = eps,
      metric = attr(x, "method"),
      sort = FALSE
    ),
      class = c("frNN", "NN"))

  if (sort)
    ret <- sort.frNN(ret)

  return(ret)
}

#' @rdname frNN
#' @export
sort.frNN <- function(x, decreasing = FALSE, ...) {
  if (isTRUE(x$sort))
    return(x)
  if (is.null(x$dist))
    stop("Unable to sort. Distances are missing.")

  ## FIXME: This is slow do this in C++
  n <- names(x$id)

  o <- lapply(
    seq_along(x$dist),
    FUN =
      function(i)
        order(x$dist[[i]], x$id[[i]], decreasing = decreasing)
  )
  x$dist <-
    lapply(
      seq_along(o),
      FUN = function(p)
        x$dist[[p]][o[[p]]]
    )
  x$id <- lapply(
    seq_along(o),
    FUN = function(p)
      x$id[[p]][o[[p]]]
  )

  names(x$dist) <- n
  names(x$id) <- n

  x$sort <- TRUE

  x
}

#' @rdname frNN
#' @export
adjacencylist.frNN <- function(x, ...)
  x$id

#' @rdname frNN
#' @export
print.frNN <- function(x, ...) {
  cat(
    "fixed radius nearest neighbors for ",
    length(x$id),
    " objects (eps=",
    x$eps,
    ").",
    "\n",
    sep = ""
  )

  cat("Distance metric:", x$metric, "\n")
  cat("\nAvailable fields: ", toString(names(x)), "\n", sep = "")
}


================================================
FILE: R/hdbscan.R
================================================
#######################################################################
# dbscan - Density Based Clustering of Applications with Noise
#          and Related Algorithms
# Copyright (C) 2015 Michael Hahsler, Matt Piekenbrock

# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program; if not, write to the Free Software Foundation, Inc.,
# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

#' Hierarchical DBSCAN (HDBSCAN)
#'
#' Fast C++ implementation of the HDBSCAN (Hierarchical DBSCAN) and its related
#' algorithms.
#'
#' This fast implementation of HDBSCAN (Campello et al., 2013) computes the
#' hierarchical cluster tree representing density estimates along with the
#' stability-based flat cluster extraction. HDBSCAN essentially computes the
#' hierarchy of all DBSCAN* clusterings, and
#' then uses a stability-based extraction method to find optimal cuts in the
#' hierarchy, thus producing a flat solution.
#'
#' HDBSCAN performs the following steps:
#'
#' 1. Compute mutual reachability distance mrd between points
#'    (based on distances and core distances).
#' 2. Use mdr as a distance measure to construct a minimum spanning tree.
#' 3. Prune the tree using stability.
#' 4. Extract the clusters.
#'
#' Additional, related algorithms including the "Global-Local Outlier Score
#' from Hierarchies" (GLOSH; see section 6 of Campello et al., 2015)
#' is available in function [glosh()]
#' and the ability to cluster based on instance-level constraints (see
#' section 5.3 of Campello et al. 2015) are supported. The algorithms only need
#' the parameter `minPts`.
#'
#' Note that `minPts` not only acts as a minimum cluster size to detect,
#' but also as a "smoothing" factor of the density estimates implicitly
#' computed from HDBSCAN.
#'
#' When using the optional parameter `cluster_selection_epsilon`,
#' a combination between DBSCAN* and HDBSCAN* can be achieved
#' (see Malzer & Baum 2020). This means that part of the
#' tree is affected by `cluster_selection_epsilon` as if
#' running DBSCAN* with `eps` = `cluster_selection_epsilon`.
#' The remaining part (on levels above the threshold) is still
#' processed by HDBSCAN*'s stability-based selection algorithm
#' and can therefore return clusters of variable densities.
#' Note that there is not always a remaining part, especially if
#' the parameter value is chosen too large, or if there aren't
#' enough clusters of variable densities. In this case, the result
#' will be equal to DBSCAN*.
# `cluster_selection_epsilon` is especially useful for cases
#' where HDBSCAN* produces too many small clusters that
#' need to be merged, while still being able to extract clusters
#' of variable densities at higher levels.
#'
#' `coredist()`: The core distance is defined for each point as
#' the distance to the `MinPts - 1`'s neighbor.
#' It is a density estimate equivalent to `kNNdist()` with `k = MinPts -1`.
#'
#' `mrdist()`: The mutual reachability distance is defined between two points as
#' `mrd(a, b) = max(coredist(a), coredist(b), dist(a, b))`. This distance metric is used by
#' HDBSCAN. It has the effect of increasing distances in low density areas.
#'
#' `predict()` assigns each new data point to the same cluster as the nearest point
#' if it is not more than that points core distance away. Otherwise the new point
#' is classified as a noise point (i.e., cluster ID 0).
#' @aliases hdbscan HDBSCAN print.hdbscan
#'
#' @family HDBSCAN functions
#' @family clustering functions
#'
#' @param x a data matrix (Euclidean distances are used) or a [dist] object
#' calculated with an arbitrary distance metric.
#' @param minPts integer; Minimum size of clusters. See details.
#' @param cluster_selection_epsilon double; a distance threshold below which
#  no clusters should be selected (see Malzer & Baum 2020)
#' @param gen_hdbscan_tree logical; should the robust single linkage tree be
#' explicitly computed (see cluster tree in Chaudhuri et al, 2010).
#' @param gen_simplified_tree logical; should the simplified hierarchy be
#' explicitly computed (see Campello et al, 2013).
#' @param verbose report progress.
#' @param ...  additional arguments are passed on.
#' @param scale integer; used to scale condensed tree based on the graphics
#' device. Lower scale results in wider colored trees lines.
#' The default `'suggest'` sets scale to the number of clusters.
#' @param gradient character vector; the colors to build the condensed tree
#' coloring with.
#' @param show_flat logical; whether to draw boxes indicating the most stable
#' clusters.
#' @param coredist numeric vector with precomputed core distances (optional).
#'
#' @return `hdbscan()` returns object of class `hdbscan` with the following components:
#' \item{cluster }{A integer vector with cluster assignments. Zero indicates
#' noise points.}
#' \item{minPts }{ value of the `minPts` parameter.}
#' \item{cluster_scores }{The sum of the stability scores for each salient
#' (flat) cluster. Corresponds to cluster IDs given the in `"cluster"` element.
#' }
#' \item{membership_prob }{The probability or individual stability of a
#' point within its clusters. Between 0 and 1.}
#' \item{outlier_scores }{The GLOSH outlier score of each point. }
#' \item{hc }{An [hclust] object of the HDBSCAN hierarchy. }
#'
#' `coredist()` returns a vector with the core distance for each data point.
#'
#' `mrdist()` returns a [dist] object containing pairwise mutual reachability distances.
#'
#' @author Matt Piekenbrock
#' @author Claudia Malzer (added cluster_selection_epsilon)
#'
#' @references
#' Campello RJGB, Moulavi D, Sander J (2013). Density-Based Clustering Based on
#' Hierarchical Density Estimates. Proceedings of the 17th Pacific-Asia
#' Conference on Knowledge Discovery in Databases, PAKDD 2013, _Lecture Notes
#' in Computer Science_ 7819, p. 160.
#' \doi{10.1007/978-3-642-37456-2_14}
#'
#' Campello RJGB, Moulavi D, Zimek A, Sander J (2015). Hierarchical density
#' estimates for data clustering, visualization, and outlier detection.
#' _ACM Transactions on Knowledge Discovery from Data (TKDD),_ 10(5):1-51.
#' \doi{10.1145/2733381}
#'
#' Malzer, C., & Baum, M. (2020). A Hybrid Approach To Hierarchical
#' Density-based Cluster Selection.
#' In 2020 IEEE International Conference on Multisensor Fusion
#' and Integration for Intelligent Systems (MFI), pp. 223-228.
#' \doi{10.1109/MFI49285.2020.9235263}
#' @keywords model clustering hierarchical
#' @examples
#' ## cluster the moons data set with HDBSCAN
#' data(moons)
#'
#' res <- hdbscan(moons, minPts = 5)
#' res
#'
#' plot(res)
#' clplot(moons, res)
#'
#' ## cluster the moons data set with HDBSCAN using Manhattan distances
#' res <- hdbscan(dist(moons, method = "manhattan"), minPts = 5)
#' plot(res)
#' clplot(moons, res)
#'
#' ## Example for HDBSCAN(e) using cluster_selection_epsilon
#' # data with clusters of various densities.
#' X <- data.frame(
#'  x = c(
#'   0.08, 0.46, 0.46, 2.95, 3.50, 1.49, 6.89, 6.87, 0.21, 0.15,
#'   0.15, 0.39, 0.80, 0.80, 0.37, 3.63, 0.35, 0.30, 0.64, 0.59, 1.20, 1.22,
#'   1.42, 0.95, 2.70, 6.36, 6.36, 6.36, 6.60, 0.04, 0.71, 0.57, 0.24, 0.24,
#'   0.04, 0.04, 1.35, 0.82, 1.04, 0.62, 0.26, 5.98, 1.67, 1.67, 0.48, 0.15,
#'   6.67, 6.67, 1.20, 0.21, 3.99, 0.12, 0.19, 0.15, 6.96, 0.26, 0.08, 0.30,
#'   1.04, 1.04, 1.04, 0.62, 0.04, 0.04, 0.04, 0.82, 0.82, 1.29, 1.35, 0.46,
#'   0.46, 0.04, 0.04, 5.98, 5.98, 6.87, 0.37, 6.47, 6.47, 6.47, 6.67, 0.30,
#'   1.49, 3.21, 3.21, 0.75, 0.75, 0.46, 0.46, 0.46, 0.46, 3.63, 0.39, 3.65,
#'   4.09, 4.01, 3.36, 1.43, 3.28, 5.94, 6.35, 6.87, 5.60, 5.99, 0.12, 0.00,
#'   0.32, 0.39, 0.00, 1.63, 1.36, 5.67, 5.60, 5.79, 1.10, 2.99, 0.39, 0.18
#'   ),
#'  y = c(
#'   7.41, 8.01, 8.01, 5.44, 7.11, 7.13, 1.83, 1.83, 8.22, 8.08,
#'   8.08, 7.20, 7.83, 7.83, 8.29, 5.99, 8.32, 8.22, 7.38, 7.69, 8.22, 7.31,
#'   8.25, 8.39, 6.34, 0.16, 0.16, 0.16, 1.66, 7.55, 7.90, 8.18, 8.32, 8.32,
#'   7.97, 7.97, 8.15, 8.43, 7.83, 8.32, 8.29, 1.03, 7.27, 7.27, 8.08, 7.27,
#'   0.79, 0.79, 8.22, 7.73, 6.62, 7.62, 8.39, 8.36, 1.73, 8.29, 8.04, 8.22,
#'   7.83, 7.83, 7.83, 8.32, 8.11, 7.69, 7.55, 7.20, 7.20, 8.01, 8.15, 7.55,
#'   7.55, 7.97, 7.97, 1.03, 1.03, 1.24, 7.20, 0.47, 0.47, 0.47, 0.79, 8.22,
#'   7.13, 6.48, 6.48, 7.10, 7.10, 8.01, 8.01, 8.01, 8.01, 5.99, 8.04, 5.22,
#'   5.82, 5.14, 4.81, 7.62, 5.73, 0.55, 1.31, 0.05, 0.95, 1.59, 7.99, 7.48,
#'   8.38, 7.12, 2.01, 1.40, 0.00, 9.69, 9.47, 9.25, 2.63, 6.89, 0.56, 3.11
#'  )
#' )
#'
#' ## HDBSCAN splits one cluster
#' hdb <- hdbscan(X, minPts = 3)
#' plot(hdb, show_flat = TRUE)
#' hullplot(X, hdb, main = "HDBSCAN")
#'
#' ## DBSCAN* marks the least dense cluster as outliers
#' db <- dbscan(X, eps = 1, minPts = 3, borderPoints = FALSE)
#' hullplot(X, db, main = "DBSCAN*")
#'
#' ## HDBSCAN(e) mixes HDBSCAN AND DBSCAN* to find all clusters
#' hdbe <- hdbscan(X, minPts = 3, cluster_selection_epsilon = 1)
#' plot(hdbe, show_flat = TRUE)
#' hullplot(X, hdbe, main = "HDBSCAN(e)")
#' @export
hdbscan <- function(x,
                    minPts,
                    cluster_selection_epsilon = 0.0,
                    gen_hdbscan_tree = FALSE,
                    gen_simplified_tree = FALSE,
                    verbose = FALSE) {
  if (!inherits(x, "dist") && !.matrixlike(x)) {
    stop("hdbscan expects a numeric matrix or a dist object.")
  }

  ## 1. Calculate the mutual reachability between points
  if (verbose) {
    cat("Calculating core distances...\n")
  }
  coredist <- coredist(x, minPts)


  if (verbose) {
    cat("Calculating the mutual reachability matrix distances...\n")
  }
  mrd <- mrdist(x, minPts, coredist = coredist)
  n <- attr(mrd, "Size")

  ## 2. Construct a minimum spanning tree and convert to RSL representation
  if (verbose) {
    cat("Constructing the minimum spanning tree...\n")
  }
  mst <- mst(mrd, n)
  hc <- hclustMergeOrder(mst, order(mst[, 3]))
  hc$call <- match.call()

  ## 3. Prune the tree
  ## Process the hierarchy to retrieve all the necessary info needed by HDBSCAN
  if (verbose) {
    cat("Tree pruning...\n")
  }
  res <- computeStability(hc, minPts, compute_glosh = TRUE)
  res <- extractUnsupervised(res, cluster_selection_epsilon = cluster_selection_epsilon)
  cl <- attr(res, "cluster")

  ## 4. Extract the clusters
  if (verbose) {
    cat("Extract clusters...\n")
  }
  sl <- attr(res, "salient_clusters")

  ## Generate membership 'probabilities' using core distance as the measure of density
  prob <- rep(0, length(cl))
  for (cid in sl) {
    max_f <- max(coredist[which(cl == cid)])
    pr <- (max_f - coredist[which(cl == cid)]) / max_f
    prob[cl == cid] <- pr
  }

  ## Match cluster assignments to be incremental, with 0 representing noise
  if (any(cl == 0)) {
    cluster <- match(cl, c(0, sl)) - 1
  } else {
    cluster <- match(cl, sl)
  }
  cl_map <-
    structure(sl, names = unique(cluster[hc$order][cluster[hc$order] != 0]))

  ## Stability scores
  ## NOTE: These scores represent the stability scores -before- the hierarchy traversal
  cluster_scores <-
    vapply(sl, function(sl_cid) {
      res[[as.character(sl_cid)]]$stability
    }, numeric(1L))
  names(cluster_scores) <- names(cl_map)

  ## Return everything HDBSCAN does
  attr(res, "cl_map") <-
    cl_map # Mapping of hierarchical IDS to 'normalized' incremental ids
  out <- structure(
    list(
      cluster = cluster,
      minPts = minPts,
      coredist = coredist,
      cluster_scores = cluster_scores,
      # (Cluster-wide cumulative) Stability Scores
      membership_prob = prob,
      # Individual point membership probabilities
      outlier_scores = attr(res, "glosh"),
      # Outlier Scores
      hc = hc # Hclust object of MST (can be cut for quick assignments)
    ),
    class = "hdbscan",
    hdbscan = res
  ) # hdbscan attributes contains actual HDBSCAN hierarchy

  ## The trees don't need to be explicitly computed, but they may be useful if the user wants them
  if (gen_hdbscan_tree) {
    out$hdbscan_tree <- buildDendrogram(hc)
  }
  if (gen_simplified_tree) {
    out$simplified_tree <- simplifiedTree(res)
  }
  return(out)
}

#' @rdname hdbscan
#' @export
print.hdbscan <- function(x, ...) {
  writeLines(c(
    paste0("HDBSCAN clustering for ", nobs(x), " objects."),
    paste0("Parameters: minPts = ", x$minPts),
    paste0(
      "The clustering contains ",
      ncluster(x),
      " cluster(s) and ",
      nnoise(x),
      " noise points."
    )
  ))

  print(table(x$cluster))
  cat("\n")
  writeLines(strwrap(paste0("Available fields: ", toString(names(
    x
  ))), exdent = 18))
}

#' @rdname hdbscan
#' @param leaflab a string specifying how leaves are labeled (see [stats::plot.dendrogram()]).
#' @param ylab the label for the y axis.
#' @param main Title of the plot.
#' @export
plot.hdbscan <-
  function(x,
           scale = "suggest",
           gradient = c("yellow", "red"),
           show_flat = FALSE,
           main = "HDBSCAN*",
           ylab = "eps value",
           leaflab = "none",
           ...) {
    ## Logic checks
    if (!(scale == "suggest" ||
          scale > 0)) {
      stop("scale parameter must be greater than 0.")
    }

    ## Main information needed
    hd_info <- attr(x, "hdbscan")
    dend <- x$simplified_tree %||% simplifiedTree(hd_info)
    coords <-
      node_xy(hd_info, cl_hierarchy = attr(hd_info, "cl_hierarchy"))

    ## Variables to help setup the scaling of the plotting
    nclusters <- length(hd_info)
    npoints <- length(x$cluster)
    nleaves <-
      length(all_children(
        attr(hd_info, "cl_hierarchy"),
        key = 0,
        leaves_only = TRUE
      ))

    scale <- ifelse(scale == "suggest", nclusters, nclusters / scale)

    ## Color variables
    col_breaks <- seq(0, length(x$cluster) + nclusters, by = nclusters)
    gcolors <- grDevices::colorRampPalette(gradient)(length(col_breaks))

    ## Depth-first search to recursively plot rectangles
    eps_dfs <- function(dend, index, parent_height, scale) {
      coord <- coords[index, ]
      cl_key <- as.character(attr(dend, "label"))

      ## widths == number of points in the cluster at each eps it was alive
      widths <-
        vapply(sort(hd_info[[cl_key]]$eps, decreasing = TRUE), function(eps) {
          sum(hd_info[[cl_key]]$eps <= eps)
        }, numeric(1L))
      if (length(widths) > 0) {
        widths <- c(widths + hd_info[[cl_key]]$n_children,
                    rep(hd_info[[cl_key]]$n_children, hd_info[[cl_key]]$n_children))
      } else {
        widths <-
          rep(hd_info[[cl_key]]$n_children, hd_info[[cl_key]]$n_children)
      }

      ## Normalize and scale widths to length of x-axis
      normalize <- function(x) {
        (nleaves) * (x - 1) / (npoints - 1)
      }
      xleft <- coord[[1]] - normalize(widths) / scale
      xright <- coord[[1]] + normalize(widths) / scale

      ## Top is always parent height, bottom is when the points died
      ## Minor adjustment made if at the root equivalent to plot.dendrogram(edge.root=T)
      if (cl_key == "0") {
        ytop <-
          rep(hd_info[[cl_key]]$eps_birth + 0.0625 * hd_info[[cl_key]]$eps_birth,
              length(widths))
        ybottom <- rep(hd_info[[cl_key]]$eps_death, length(widths))
      } else {
        ytop <- rep(parent_height, length(widths))
        ybottom <-
          c(
            sort(hd_info[[cl_key]]$eps, decreasing = TRUE),
            rep(hd_info[[cl_key]]$eps_death, hd_info[[cl_key]]$n_children)
          )
      }

      ## Draw the rectangles
      rect_color <-
        gcolors[.bincode(length(widths), breaks = col_breaks)]
      graphics::rect(
        xleft = xleft,
        xright = xright,
        ybottom = ybottom,
        ytop = ytop,
        col = rect_color,
        border = NA,
        lwd = 0
      )

      ## Highlight the most 'stable' clusters returned by the default flat cluster extraction
      if (show_flat) {
        salient_cl <- attr(hd_info, "salient_clusters")
        if (as.integer(attr(dend, "label")) %in% salient_cl) {
          x_adjust <-
            (max(xright) - min(xleft)) * 0.10 # 10% left/right border
          y_adjust <-
            (max(ytop) - min(ybottom)) * 0.025 # 2.5% above/below border
          graphics::rect(
            xleft = min(xleft) - x_adjust,
            xright = max(xright) + x_adjust,
            ybottom = min(ybottom) - y_adjust,
            ytop = max(ytop) + y_adjust,
            border = "red",
            lwd = 1
          )
          n_label <-
            names(which(attr(hd_info, "cl_map") == attr(dend, "label")))
          text(
            x = coord[[1]],
            y = min(ybottom),
            pos = 1,
            labels = n_label
          )
        }
      }

      ## Recurse in depth-first-manner
      if (is.leaf(dend)) {
        return(index)
      } else {
        left <-
          eps_dfs(
            dend[[1]],
            index = index + 1,
            parent_height = attr(dend, "height"),
            scale = scale
          )
        right <-
          eps_dfs(
            dend[[2]],
            index = left + 1,
            parent_height = attr(dend, "height"),
            scale = scale
          )
        return(right)
      }
    }

    ## Run the recursive plotting
    plot(
      dend,
      edge.root = TRUE,
      main = main,
      ylab = ylab,
      leaflab = leaflab,
      ...
    )
    eps_dfs(dend,
            index = 1,
            parent_height = 0,
            scale = scale)
    return(invisible(x))
  }

#' @rdname hdbscan
#' @export
coredist <- function(x, minPts)
  kNNdist(x, k = minPts - 1)

#' @rdname hdbscan
#' @export
mrdist <- function(x, minPts, coredist = NULL) {
  if (inherits(x, "dist")) {
    .check_dist(x)
    x_dist <- x
  } else {
    x_dist <- dist(x,
                   method = "euclidean",
                   diag = FALSE,
                   upper = FALSE)
  }

  if (is.null(coredist)) {
    coredist <- coredist(x, minPts)
  }

  # mr_dist <- as.vector(pmax(as.dist(outer(coredist, coredist, pmax)), x_dist))
  # much faster in C++
  mr_dist <- mrd(x_dist, coredist)
  class(mr_dist) <- "dist"
  attr(mr_dist, "Size") <- attr(x_dist, "Size")
  attr(mr_dist, "Diag") <- FALSE
  attr(mr_dist, "Upper") <- FALSE
  attr(mr_dist, "method") <- paste0("mutual reachability (", attr(x_dist, "method"), ")")
  mr_dist
}


================================================
FILE: R/hullplot.R
================================================
#######################################################################
# dbscan - Density Based Clustering of Applications with Noise
#          and Related Algorithms
# Copyright (C) 2015 Michael Hahsler

# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program; if not, write to the Free Software Foundation, Inc.,
# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

#' Plot Clusters
#'
#' This function produces a two-dimensional scatter plot of data points
#' and colors the data points according to a supplied clustering. Noise points
#' are marked as `x`. `hullplot()` also adds convex hulls to clusters.
#'
#' @name hullplot
#' @aliases hullplot clplot
#'
#' @param x a data matrix. If more than 2 columns are provided, then the data
#' is plotted using the first two principal components.
#' @param cl a clustering. Either a numeric cluster assignment vector or a
#' clustering object (a list with an element named `cluster`).
#' @param col colors used for clusters. Defaults to the standard palette.  The
#' first color (default is black) is used for noise/unassigned points (cluster
#' id 0).
#' @param pch a vector of plotting characters. By default `o` is used for
#'   points and `x` for noise points.
#' @param cex expansion factor for symbols.
#' @param hull_lwd,hull_lty line width and line type used for the convex hull.
#' @param main main title.
#' @param solid,alpha draw filled polygons instead of just lines for the convex
#' hulls? alpha controls the level of alpha shading.
#' @param ...  additional arguments passed on to plot.
#' @author Michael Hahsler
#' @keywords plot clustering
#' @examples
#' set.seed(2)
#' n <- 400
#'
#' x <- cbind(
#'   x = runif(4, 0, 1) + rnorm(n, sd = 0.1),
#'   y = runif(4, 0, 1) + rnorm(n, sd = 0.1)
#'   )
#' cl <- rep(1:4, times = 100)
#'
#'
#' ### original data with true clustering
#' clplot(x, cl, main = "True clusters")
#' hullplot(x, cl, main = "True clusters")
#' ### use different symbols
#' hullplot(x, cl, main = "True clusters", pch = cl)
#' ### just the hulls
#' hullplot(x, cl, main = "True clusters", pch = NA)
#' ### a version suitable for b/w printing)
#' hullplot(x, cl, main = "True clusters", solid = FALSE,
#'   col = c("grey", "black"), pch = cl)
#'
#'
#' ### run some clustering algorithms and plot the results
#' db <- dbscan(x, eps = .07, minPts = 10)
#' clplot(x, db, main = "DBSCAN")
#' hullplot(x, db, main = "DBSCAN")
#'
#' op <- optics(x, eps = 10, minPts = 10)
#' opDBSCAN <- extractDBSCAN(op, eps_cl = .07)
#' hullplot(x, opDBSCAN, main = "OPTICS")
#'
#' opXi <- extractXi(op, xi = 0.05)
#' hullplot(x, opXi, main = "OPTICSXi")
#'
#' # Extract minimal 'flat' clusters only
#' opXi <- extractXi(op, xi = 0.05, minimum = TRUE)
#' hullplot(x, opXi, main = "OPTICSXi")
#'
#' km <- kmeans(x, centers = 4)
#' hullplot(x, km, main = "k-means")
#'
#' hc <- cutree(hclust(dist(x)), k = 4)
#' hullplot(x, hc, main = "Hierarchical Clustering")
#' @export
hullplot <- function(x,
  cl,
  col = NULL,
  pch = NULL,
  cex = 0.5,
  hull_lwd = 1,
  hull_lty = 1,
  solid = TRUE,
  alpha = .2,
  main = "Convex Cluster Hulls",
  ...) {
  ### handle d>2 by using PCA
  if (ncol(x) > 2)
    x <- prcomp(x)$x

  ### extract clustering (keep hierarchical xICSXi structure)
  if (inherits(cl, "xics") || "clusters_xi" %in% names(cl)) {
    clusters_xi <- cl$clusters_xi
    cl_order <- cl$order
  } else
    clusters_xi <- NULL

  if (is.list(cl))
    cl <- cl$cluster
  if (!is.numeric(cl))
    stop("Could not get cluster assignment vector from cl.")

  #if(is.null(col)) col <- c("#000000FF", rainbow(n=max(cl)))
  if (is.null(col))
    col <- palette()

  # Note: We use the first color for noise points
  if (length(col) == 1L)
    col <- c(col, col)
  col_noise <- col[1]
  col <- col[-1]


  if (max(cl) > length(col)) {
    warning("Not enough colors. Some colors will be reused.")
    col <- rep(col, length.out = max(cl))
  }

  # mark noise points
  pch <- pch %||% ifelse(cl == 0L, 4L, 1L)

  plot(x[, 1:2],
    col = c(col_noise, col)[cl + 1L],
    pch = pch,
    cex = cex,
    main = main,
    ...)

  col_poly <- adjustcolor(col, alpha.f = alpha)
  border <- col

  ## no border?
  if (is.null(hull_lwd) || is.na(hull_lwd) || hull_lwd == 0) {
    hull_lwd <- 1
    border <- NA
  }

  if (inherits(cl, "xics") || "clusters_xi" %in% names(cl)) {
    ## This is necessary for larger datasets: Ensure largest is plotted first
    clusters_xi <-
      clusters_xi[order(-(clusters_xi$end - clusters_xi$start)), ] # Order by size (descending)
    ci_order <- clusters_xi$cluster_id
  } else {
    ci_order <- 1:max(cl)
  }

  for (i in seq_along(ci_order)) {
    ### use all the points for xICSXi's hierarchical structure
    if (is.null(clusters_xi)) {
      d <- x[cl == i, , drop = FALSE]
    } else {
      d <-
        x[cl_order[clusters_xi$start[i]:clusters_xi$end[i]], , drop = FALSE]
    }

    ch <- chull(d)
    ch <- c(ch, ch[1])
    if (!solid) {
      lines(d[ch, ],
            col = border[ci_order[i]],
            lwd = hull_lwd,
            lty = hull_lty)
    } else {
      polygon(
        d[ch, ],
        col = col_poly[ci_order[i]],
        lwd = hull_lwd,
        lty = hull_lty,
        border = border[ci_order[i]]
      )
    }
  }
}

#' @rdname hullplot
#' @export
clplot <- function(x,
                   cl,
                   col = NULL,
                   pch = NULL,
                   cex = 0.5,
                   main = "Cluster Plot",
                   ...)
  hullplot(x, cl = cl, col = col, pch = pch, cex = cex, main = main,
          solid = FALSE, hull_lwd = NA)


================================================
FILE: R/jpclust.R
================================================
#######################################################################
# dbscan - Density Based Clustering of Applications with Noise
#          and Related Algorithms
# Copyright (C) 2017 Michael Hahsler

# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program; if not, write to the Free Software Foundation, Inc.,
# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

#' Jarvis-Patrick Clustering
#'
#' Fast C++ implementation of the Jarvis-Patrick clustering which first builds
#' a shared nearest neighbor graph (k nearest neighbor sparsification) and then
#' places two points in the same cluster if they are in each others nearest
#' neighbor list and they share at least kt nearest neighbors.
#'
#' Following the original paper, the shared nearest neighbor list is
#' constructed as the k neighbors plus the point itself (as neighbor zero).
#' Therefore, the threshold `kt` needs to be in the range \eqn{[1, k]}.
#'
#' Fast nearest neighbors search with [kNN()] is only used if `x` is
#' a matrix. In this case Euclidean distance is used.
#'
#' @aliases jpclust print.general_clustering
#' @family clustering functions
#'
#' @param x a data matrix/data.frame (Euclidean distance is used), a
#' precomputed [dist] object or a kNN object created with [kNN()].
#' @param k Neighborhood size for nearest neighbor sparsification. If `x`
#' is a kNN object then `k` may be missing.
#' @param kt threshold on the number of shared nearest neighbors (including the
#' points themselves) to form clusters. Range: \eqn{[1, k]}
#' @param ...  additional arguments are passed on to the k nearest neighbor
#' search algorithm. See [kNN()] for details on how to control the
#' search strategy.
#'
#' @return A object of class `general_clustering` with the following
#' components:
#' \item{cluster }{A integer vector with cluster assignments. Zero
#' indicates noise points.}
#' \item{type }{ name of used clustering algorithm.}
#' \item{metric }{ the distance metric used for clustering.}
#' \item{param }{ list of used clustering parameters. }
#'
#' @author Michael Hahsler
#' @references R. A. Jarvis and E. A. Patrick. 1973. Clustering Using a
#' Similarity Measure Based on Shared Near Neighbors. _IEEE Trans. Comput.
#' 22,_ 11 (November 1973), 1025-1034.
#' \doi{10.1109/T-C.1973.223640}
#' @keywords model clustering
#' @examples
#' data("DS3")
#'
#' # use a shared neighborhood of 20 points and require 12 shared neighbors
#' cl <- jpclust(DS3, k = 20, kt = 12)
#' cl
#'
#' clplot(DS3, cl)
#' # Note: JP clustering does not consider noise and thus,
#' # the sine wave points chain clusters together.
#'
#' # use a precomputed kNN object instead of the original data.
#' nn <- kNN(DS3, k = 30)
#' nn
#'
#' cl <- jpclust(nn, k = 20, kt = 12)
#' cl
#'
#' # cluster with noise removed (use low pointdensity to identify noise)
#' d <- pointdensity(DS3, eps = 25)
#' hist(d, breaks = 20)
#' DS3_noiseless <- DS3[d > 110,]
#'
#' cl <- jpclust(DS3_noiseless, k = 20, kt = 10)
#' cl
#'
#' clplot(DS3_noiseless, cl)
#' @export
jpclust <- function(x, k, kt, ...) {
  # Create NN graph
  if (missing(k) && inherits(x, "kNN"))
      k <- x$k
  if (length(kt) != 1 || kt < 1 || kt > k)
    stop("kt needs to be a threshold in range [1, k].")

  nn <- kNN(x, k, sort = FALSE, ...)

  # Perform clustering
  cl <- JP_int(nn$id, kt = as.integer(kt))

  structure(
    list(
      cluster = as.integer(factor(cl)),
      type = "Jarvis-Patrick clustering",
      metric = nn$metric,
      param = list(k = k, kt = kt)
    ),
    class = c("general_clustering")
  )
}

#' @export
print.general_clustering <- function(x, ...) {
  cl <- unique(x$cluster)
  cl <- length(cl[cl != 0L])

  writeLines(c(
    paste0(x$type, " for ", length(x$cluster), " objects."),
    paste0("Parameters: ",
      paste(
        names(x$param),
        unlist(x$param, use.names = FALSE),
        sep = " = ",
        collapse = ", "
      )),
    paste0(
      "The clustering contains ",
      cl,
      " cluster(s) and ",
      sum(x$cluster == 0L),
      " noise points."
    )
  ))

  print(table(x$cluster))
  cat("\n")

  writeLines(strwrap(paste0(
    "Available fields: ",
    toString(names(x))
  ), exdent = 18))
}


================================================
FILE: R/kNN.R
================================================
#######################################################################
# dbscan - Density Based Clustering of Applications with Noise
#          and Related Algorithms
# Copyright (C) 2015 Michael Hahsler

# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program; if not, write to the Free Software Foundation, Inc.,
# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.


#' Find the k Nearest Neighbors
#'
#' This function uses a kd-tree to find all k nearest neighbors in a data
#' matrix (including distances) fast.
#'
#' **Ties:** If the kth and the (k+1)th nearest neighbor are tied, then the
#' neighbor found first is returned and the other one is ignored.
#'
#' **Self-matches:** If no query is specified, then self-matches are
#' removed.
#'
#' Details on the search parameters:
#'
#' * `search` controls if
#' a kd-tree or linear search (both implemented in the ANN library; see Mount
#' and Arya, 2010). Note, that these implementations cannot handle NAs.
#' `search = "dist"` precomputes Euclidean distances first using R. NAs are
#' handled, but the resulting distance matrix cannot contain NAs. To use other
#' distance measures, a precomputed distance matrix can be provided as `x`
#' (`search` is ignored).
#'
#' * `bucketSize` and `splitRule` influence how the kd-tree is
#' built. `approx` uses the approximate nearest neighbor search
#' implemented in ANN. All nearest neighbors up to a distance of
#' `eps / (1 + approx)` will be considered and all with a distance
#' greater than `eps` will not be considered. The other points might be
#' considered. Note that this results in some actual nearest neighbors being
#' omitted leading to spurious clusters and noise points. However, the
#' algorithm will enjoy a significant speedup. For more details see Mount and
#' Arya (2010).
#'
#' @aliases kNN knn
#' @family NN functions
#'
#' @param x a data matrix, a [dist] object or a [kNN] object.
#' @param k number of neighbors to find.
#' @param query a data matrix with the points to query. If query is not
#' specified, the NN for all the points in `x` is returned. If query is
#' specified then `x` needs to be a data matrix.
#' @param search nearest neighbor search strategy (one of `"kdtree"`, `"linear"` or
#' `"dist"`).
#' @param sort sort the neighbors by distance? Note that some search methods
#' already sort the results. Sorting is expensive and `sort = FALSE` may
#' be much faster for some search methods. kNN objects can be sorted using
#' `sort()`.
#' @param bucketSize max size of the kd-tree leafs.
#' @param splitRule rule to split the kd-tree. One of `"STD"`, `"MIDPT"`, `"FAIR"`,
#' `"SL_MIDPT"`, `"SL_FAIR"` or `"SUGGEST"` (SL stands for sliding). `"SUGGEST"` uses
#' ANNs best guess.
#' @param approx use approximate nearest neighbors. All NN up to a distance of
#' a factor of `1 + approx` eps may be used. Some actual NN may be omitted
#' leading to spurious clusters and noise points.  However, the algorithm will
#' enjoy a significant speedup.
#' @param decreasing sort in decreasing order?
#' @param ... further arguments
#'
#' @return An object of class `kNN` (subclass of [NN]) containing a
#' list with the following components:
#' \item{dist }{a matrix with distances. }
#' \item{id }{a matrix with `ids`. }
#' \item{k }{number `k` used. }
#' \item{metric }{ used distance metric. }
#'
#' @author Michael Hahsler
#' @references David M. Mount and Sunil Arya (2010). ANN: A Library for
#' Approximate Nearest Neighbor Searching,
#' \url{http://www.cs.umd.edu/~mount/ANN/}.
#' @keywords model
#' @examples
#' data(iris)
#' x <- iris[, -5]
#'
#' # Example 1: finding kNN for all points in a data matrix (using a kd-tree)
#' nn <- kNN(x, k = 5)
#' nn
#'
#' # explore neighborhood of point 10
#' i <- 10
#' nn$id[i,]
#' plot(x, col = ifelse(seq_len(nrow(iris)) %in% nn$id[i,], "red", "black"))
#'
#' # visualize the 5 nearest neighbors
#' plot(nn, x)
#'
#' # visualize a reduced 2-NN graph
#' plot(kNN(nn, k = 2), x)
#'
#' # Example 2: find kNN for query points
#' q <- x[c(1,100),]
#' nn <- kNN(x, k = 10, query = q)
#'
#' plot(nn, x, col = "grey")
#' points(q, pch = 3, lwd = 2)
#'
#' # Example 3: find kNN using distances
#' d <- dist(x, method = "manhattan")
#' nn <- kNN(d, k = 1)
#' plot(nn, x)
#' @export
kNN <-
  function(x,
    k,
    query = NULL,
    sort = TRUE,
    search = "kdtree",
    bucketSize = 10,
    splitRule = "suggest",
    approx = 0) {
    if (inherits(x, "kNN")) {
      if (x$k < k)
        stop("kNN in x has not enough nearest neighbors.")
      if (!x$sort)
        x <- sort(x)
      x$id <- x$id[, 1:k]
      if (!is.null(x$dist))
        x$dist <- x$dist[, 1:k]
      if (!is.null(x$shared))
        x$dist <- x$shared[, 1:k]
      x$k <- k
      return(x)
    }

    search <- .parse_search(search)
    splitRule <- .parse_splitRule(splitRule)

    k <- as.integer(k)
    if (k < 1)
      stop("Illegal k: needs to be k>=1!")

    ### dist search
    if (search == 3 && !inherits(x, "dist")) {
      if (.matrixlike(x))
        x <- dist(x)
      else
        stop("x needs to be a matrix to calculate distances")
    }

    ### get kNN from a dist object
    if (inherits(x, "dist")) {
      if (!is.null(query))
        stop("query can only be used if x contains a data matrix.")

      if (anyNA(x))
        stop("distances cannot be NAs for kNN!")

      return(dist_to_kNN(x, k = k))
    }

    ## make sure x is numeric
    if (!.matrixlike(x))
      stop("x needs to be a matrix to calculate distances")
    x <- as.matrix(x)
    if (storage.mode(x) == "integer")
      storage.mode(x) <- "double"
    if (storage.mode(x) != "double")
      stop("x has to be a numeric matrix.")

    if (!is.null(query)) {
      query <- as.matrix(query)
      if (storage.mode(query) == "integer")
        storage.mode(query) <- "double"
      if (storage.mode(query) != "double")
        stop("query has to be NULL or a numeric matrix.")
      if (ncol(x) != ncol(query))
        stop("x and query need to have the same number of columns!")
    }

    if (k >= nrow(x))
      stop("Not enough neighbors in data set!")


    if (anyNA(x))
      stop("data/distances cannot contain NAs for kNN (with kd-tree)!")

    ## returns NO self matches
    if (!is.null(query)) {
      ret <- kNN_query_int(
        as.matrix(x),
        as.matrix(query),
        as.integer(k),
        as.integer(search),
        as.integer(bucketSize),
        as.integer(splitRule),
        as.double(approx)
      )
      dimnames(ret$dist) <- list(rownames(query), 1:k)
      dimnames(ret$id) <- list(rownames(query), 1:k)
    } else {
      ret <- kNN_int(
        as.matrix(x),
        as.integer(k),
        as.integer(search),
        as.integer(bucketSize),
        as.integer(splitRule),
        as.double(approx)
      )
      dimnames(ret$dist) <- list(rownames(x), 1:k)
      dimnames(ret$id) <- list(rownames(x), 1:k)
    }

    class(ret) <- c("kNN", "NN")

    ### ANN already returns them sorted (by dist but not by ID)
    if (sort)
      ret <- sort(ret)

    ret$metric <- "euclidean"

    ret
  }

# make sure we have a lower-triangle representation w/o diagonal
.check_dist <- function(x) {
  if (!inherits(x, "dist"))
    stop("x needs to be a dist object")

  # cluster::dissimilarity does not have Diag or Upper attributes, but is a lower triangle
  # representation
  if (inherits(x, "dissimilarity"))
    return(TRUE)

  # check that dist objects have diag = FALSE, upper = FALSE
  if (attr(x, "Diag") || attr(x, "Upper"))
    stop("x needs to be a dist object with attributes Diag and Upper set to FALSE. Use as.dist(x, diag = FALSE, upper = FALSE) fist.")
  }

dist_to_kNN <- function(x, k) {
  .check_dist(x)

  n <- attr(x, "Size")

  id <- structure(integer(n * k), dim = c(n, k))
  d <- matrix(NA_real_, nrow = n, ncol = k)

  for (i in seq_len(n)) {
    ### Inf -> no self-matches
    y <- dist_row(x, i, self_val = Inf)
    o <- order(y, decreasing = FALSE)
    o <- o[seq_len(k)]
    id[i, ] <- o
    d[i, ] <- y[o]
  }
  dimnames(id) <- list(labels(x), seq_len(k))
  dimnames(d) <- list(labels(x), seq_len(k))

  ret <-
    structure(list(
      dist = d,
      id = id,
      k = k,
      sort = TRUE,
      metric = attr(x, "method")
    ),
      class = c("kNN", "NN"))

  return(ret)
}

#' @rdname kNN
#' @export
sort.kNN <- function(x, decreasing = FALSE, ...) {
  if (isTRUE(x$sort))
    return(x)
  if (is.null(x$dist))
    stop("Unable to sort. Distances are missing.")
  if (ncol(x$id) < 2) {
    x$sort <- TRUE
    return(x)
  }

  ## sort first by dist and break ties using id
  o <- vapply(
    seq_len(nrow(x$dist)),
    function(i) order(x$dist[i, ], x$id[i, ], decreasing = decreasing),
    integer(ncol(x$id))
  )
  for (i in seq_len(ncol(o))) {
    x$dist[i, ] <- x$dist[i, ][o[, i]]
    x$id[i, ] <- x$id[i, ][o[, i]]
  }
  x$sort <- TRUE

  x
}

#' @rdname kNN
#' @export
adjacencylist.kNN <- function(x, ...)
  lapply(
    seq_len(nrow(x$id)),
    FUN = function(i) {
      ## filter NAs
      tmp <- x$id[i, ]
      tmp[!is.na(tmp)]
    }
  )

#' @rdname kNN
#' @export
print.kNN <- function(x, ...) {
  cat("k-nearest neighbors for ",
    nrow(x$id),
    " objects (k=",
    x$k,
    ").",
    "\n",
    sep = "")
  cat("Distance metric:", x$metric, "\n")
  cat("\nAvailable fields: ", toString(names(x)), "\n", sep = "")
}

# Convert names to integers for C++
.parse_search <- function(search) {
  search <- pmatch(toupper(search), c("KDTREE", "LINEAR", "DIST"))
  if (is.na(search))
    stop("Unknown NN search type!")
  search
}

.parse_splitRule <- function(splitRule) {
  splitRule <- pmatch(toupper(splitRule), .ANNsplitRule) - 1L
  if (is.na(splitRule))
    stop("Unknown splitRule!")
  splitRule
}


================================================
FILE: R/kNNdist.R
================================================
#######################################################################
# dbscan - Density Based Clustering of Applications with Noise
#          and Related Algorithms
# Copyright (C) 2015 Michael Hahsler

# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program; if not, write to the Free Software Foundation, Inc.,
# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

#' Calculate and Plot k-Nearest Neighbor Distances
#'
#' Fast calculation of the k-nearest neighbor distances for a dataset
#' represented as a matrix of points. The kNN distance is defined as the
#' distance from a point to its k nearest neighbor. The kNN distance plot
#' displays the kNN distance of all points sorted from smallest to largest. The
#' plot can be used to help find suitable parameter values for [dbscan()].
#'
#' @family Outlier Detection Functions
#' @family NN functions
#'
#' @param x the data set as a matrix of points (Euclidean distance is used) or
#' a precalculated [dist] object.
#' @param k number of nearest neighbors used for the distance calculation. For
#' `kNNdistplot()` also a range of values for `k` or `minPts` can be specified.
#' @param minPts to use a k-NN plot to determine a suitable `eps` value for [dbscan()],
#'    `minPts` used in dbscan can be specified and will set `k = minPts - 1`.
#' @param all should a matrix with the distances to all k nearest neighbors be
#' returned?
#' @param ... further arguments (e.g., kd-tree related parameters) are passed
#' on to [kNN()].
#'
#' @return `kNNdist()` returns a numeric vector with the distance to its k
#' nearest neighbor. If `all = TRUE` then a matrix with k columns
#' containing the distances to all 1st, 2nd, ..., kth nearest neighbors is
#' returned instead.
#'
#' @author Michael Hahsler
#' @keywords model plot
#' @examples
#' data(iris)
#' iris <- as.matrix(iris[, 1:4])
#'
#' ## Find the 4-NN distance for each observation (see ?kNN
#' ## for different search strategies)
#' kNNdist(iris, k = 4)
#'
#' ## Get a matrix with distances to the 1st, 2nd, ..., 4th NN.
#' kNNdist(iris, k = 4, all = TRUE)
#'
#' ## Produce a k-NN distance plot to determine a suitable eps for
#' ## DBSCAN with MinPts = 5. Use k = 4 (= MinPts -1).
#' ## The knee is visible around a distance of .7
#' kNNdistplot(iris, k = 4)
#'
#' ## Look at all k-NN distance plots for a k of 1 to 10
#' ## Note that k-NN distances are increasing in k
#' kNNdistplot(iris, k = 1:20)
#'
#' cl <- dbscan(iris, eps = .7, minPts = 5)
#' pairs(iris, col = cl$cluster + 1L)
#' ## Note: black points are noise points
#' @export
kNNdist <- function(x, k, all = FALSE, ...) {
  kNNd <- kNN(x, k, sort = TRUE, ...)$dist
  if (!all)
    kNNd <- kNNd[, k]
  kNNd
}

#' @rdname kNNdist
#' @export
kNNdistplot <- function(x, k, minPts, ...) {
  if (missing(k) && missing(minPts))
    stop("k or minPts need to be specified.")

  if (missing(k))
    k <- minPts - 1

  if (length(k) == 1) {
  kNNdist <- sort(kNNdist(x, k, ...))
  plot(
    kNNdist,
    type = "l",
    ylab = paste0(k, "-NN distance"),
    xlab = "Points sorted by distance"
  )

  } else {
    knnds <- vapply(k, function(i) sort(kNNdist(x, i, ...)), numeric(nrow(x)))

    matplot(knnds, type = "l", lty = 1,
            ylab = paste0("k-NN distance"),
            xlab = "Points sorted by distance")
  }
}


================================================
FILE: R/moons.R
================================================
#######################################################################
# dbscan - Density Based Clustering of Applications with Noise
#          and Related Algorithms
# Copyright (C) 2015 Michael Hahsler

# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program; if not, write to the Free Software Foundation, Inc.,
# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

#' Moons Data
#'
#' Contains 100 2-d points, half of which are contained in two moons or
#' "blobs"" (25 points each blob), and the other half in asymmetric facing
#' crescent shapes. The three shapes are all linearly separable.
#'
#' This data was generated with the following Python commands using the
#' SciKit-Learn library:
#'
#' `> import sklearn.datasets as data`
#'
#' `> moons = data.make_moons(n_samples=50, noise=0.05)`
#'
#' `> blobs = data.make_blobs(n_samples=50, centers=[(-0.75,2.25), (1.0, 2.0)], cluster_std=0.25)`
#'
#' `> test_data = np.vstack([moons, blobs])`
#'
#' @name moons
#' @docType data
#' @format A data frame with 100 observations on the following 2 variables.
#' \describe{
#' \item{X}{a numeric vector}
#' \item{Y}{a numeric vector} }
#' @references Pedregosa, Fabian, Gael Varoquaux, Alexandre Gramfort,
#' Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel et al.
#' Scikit-learn: Machine learning in Python. _Journal of Machine Learning
#' Research_ 12, no. Oct (2011): 2825-2830.
#' @source See the HDBSCAN notebook from github documentation:
#' \url{http://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html}
#' @keywords datasets
#' @examples
#' data(moons)
#' plot(moons, pch=20)
NULL





================================================
FILE: R/ncluster.R
================================================
#######################################################################
# dbscan - Density Based Clustering of Applications with Noise
#          and Related Algorithms
# Copyright (C) 2015 Michael Hahsler, Matt Piekenbrock

# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program; if not, write to the Free Software Foundation, Inc.,
# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

#' Number of Clusters, Noise Points, and Observations
#'
#' Extract the number of clusters or the number of noise points for
#' a clustering. This function works with any clustering result that
#' contains a list element named `cluster` with a clustering vector. In
#' addition, `nobs` (see [stats::nobs()]) is also available to retrieve
#' the number of clustered points.
#'
#' @name ncluster
#' @aliases ncluster nnoise nobs
#' @family clustering functions
#'
#' @param object a clustering result object containing a `cluster` element.
#' @param ...  additional arguments are unused.
#'
#' @return returns the number if clusters or noise points.
#' @examples
#' data(iris)
#' iris <- as.matrix(iris[, 1:4])
#'
#' res <- dbscan(iris, eps = .7, minPts = 5)
#' res
#'
#' ncluster(res)
#' nnoise(res)
#' nobs(res)
#'
#' # the functions also work with kmeans and other clustering algorithms.
#' cl <- kmeans(iris, centers = 3)
#' ncluster(cl)
#' nnoise(cl)
#' nobs(res)
#' @export
ncluster <- function(object, ...) {
  UseMethod("ncluster")
}

#' @export
ncluster.default <- function(object, ...) {
  if (!is.list(object) || !is.numeric(object$cluster))
    stop("ncluster() requires a clustering object with a cluster component containing the cluster labels.")

  length(setdiff(unique(object$cluster), 0L))
}

#' @rdname ncluster
#' @export
nnoise <- function(object, ...) {
  UseMethod("nnoise")
}

#' @export
nnoise.default <- function(object, ...) {
  if (!is.list(object) || !is.numeric(object$cluster))
    stop("ncluster() requires a clustering object with a cluster component containing the cluster labels.")

  sum(object$cluster == 0L)
}


================================================
FILE: R/nobs.R
================================================

#' @importFrom stats nobs
#' @export
nobs.dbscan <- function(object, ...) length(object$cluster)

#' @export
nobs.hdbscan <- function(object, ...) length(object$cluster)

#' @export
nobs.general_clustering <- function(object, ...) length(object$cluster)



================================================
FILE: R/optics.R
================================================
#######################################################################
# dbscan - Density Based Clustering of Applications with Noise
#          and Related Algorithms
# Copyright (C) 2015 Michael Hahsler

# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program; if not, write to the Free Software Foundation, Inc.,
# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

#' Ordering Points to Identify the Clustering Structure (OPTICS)
#'
#' Implementation of the OPTICS (Ordering points to identify the clustering
#' structure) point ordering algorithm using a kd-tree.
#'
#' **The algorithm**
#'
#' This implementation of OPTICS implements the original
#' algorithm as described by Ankerst et al (1999). OPTICS is an ordering
#' algorithm with methods to extract a clustering from the ordering.
#' While using similar concepts as DBSCAN, for OPTICS `eps`
#' is only an upper limit for the neighborhood size used to reduce
#' computational complexity. Note that `minPts` in OPTICS has a different
#' effect then in DBSCAN. It is used to define dense neighborhoods, but since
#' `eps` is typically set rather high, this does not effect the ordering
#' much. However, it is also used to calculate the reachability distance and
#' larger values will make the reachability distance plot smoother.
#'
#' OPTICS linearly orders the data points such that points which are spatially
#' closest become neighbors in the ordering. The closest analog to this
#' ordering is dendrogram in single-link hierarchical clustering. The algorithm
#' also calculates the reachability distance for each point.
#' `plot()` (see [reachability_plot])
#' produces a reachability plot which shows each points reachability distance
#' between two consecutive points
#' where the points are sorted by OPTICS. Valleys represent clusters (the
#' deeper the valley, the more dense the cluster) and high points indicate
#' points between clusters.
#'
#' **Specifying the data**
#'
#' If `x` is specified as a data matrix, then Euclidean distances and fast
#' nearest neighbor lookup using a kd-tree are used. See [kNN()] for
#' details on the parameters for the kd-tree.
#'
#' **Extracting a clustering**
#'
#' Several methods to extract a clustering from the order returned by OPTICS are
#' implemented:
#'
#' * `extractDBSCAN()` extracts a clustering from an OPTICS ordering that is
#'   similar to what DBSCAN would produce with an eps set to `eps_cl` (see
#'   Ankerst et al, 1999). The only difference to a DBSCAN clustering is that
#'   OPTICS is not able to assign some border points and reports them instead as
#'   noise.
#'
#' * `extractXi()` extract clusters hierarchically specified in Ankerst et al
#'   (1999) based on the steepness of the reachability plot. One interpretation
#'   of the `xi` parameter is that it classifies clusters by change in
#'   relative cluster density. The used algorithm was originally contributed by
#'   the ELKI framework and is explained in Schubert et al (2018), but contains a
#'   set of fixes.
#'
#' **Predict cluster memberships**
#'
#' `predict()` requires an extracted DBSCAN clustering with `extractDBSCAN()` and then
#' uses predict for `dbscan()`.
#'
#' @aliases optics OPTICS
#' @family clustering functions
#'
#' @param x a data matrix or a [dist] object.
#' @param eps upper limit of the size of the epsilon neighborhood. Limiting the
#' neighborhood size improves performance and has no or very little impact on
#' the ordering as long as it is not set too low. If not specified, the largest
#' minPts-distance in the data set is used which gives the same result as
#' infinity.
#' @param minPts the parameter is used to identify dense neighborhoods and the
#' reachability distance is calculated as the distance to the minPts nearest
#' neighbor. Controls the smoothness of the reachability distribution. Default
#' is 5 points.
#' @param eps_cl Threshold to identify clusters (`eps_cl <= eps`).
#' @param xi Steepness threshold to identify clusters hierarchically using the
#' Xi method.
#' @param object an object of class `optics`.
#' @param minimum logical, representing whether or not to extract the minimal
#' (non-overlapping) clusters in the Xi clustering algorithm.
#' @param correctPredecessors logical, correct a common artifact by pruning
#' the steep up area for points that have predecessors not in the
#' cluster--found by the ELKI framework, see details below.
#' @param ...  additional arguments are passed on to fixed-radius nearest
#' neighbor search algorithm. See [frNN()] for details on how to
#' control the search strategy.
#' @param cluster,predecessor plot clusters and predecessors.
#'
#' @return An object of class `optics` with components:
#' \item{eps }{ value of `eps` parameter. }
#' \item{minPts }{ value of `minPts` parameter. }
#' \item{order }{ optics order for the data points in `x`. }
#' \item{reachdist }{ [reachability] distance for each data point in `x`. }
#' \item{coredist }{ core distance for each data point in `x`. }
#'
#' For `extractDBSCAN()`, in addition the following
#' components are available:
#' \item{eps_cl }{ the value of the `eps_cl` parameter. }
#' \item{cluster }{ assigned cluster labels in the order of the data points in `x`. }
#'
#' For `extractXi()`, in addition the following components
#' are available:
#' \item{xi}{ Steepness threshold`x`. }
#' \item{cluster }{ assigned cluster labels in the order of the data points in `x`.}
#' \item{clusters_xi }{ data.frame containing the start and end of each cluster
#' found in the OPTICS ordering. }
#'
#' @author Michael Hahsler and Matthew Piekenbrock
#' @seealso Density [reachability].
#'
#' @references Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Joerg
#' Sander (1999). OPTICS: Ordering Points To Identify the Clustering Structure.
#' _ACM SIGMOD international conference on Management of data._ ACM Press. pp.
#' \doi{10.1145/304181.304187}
#'
#' Hahsler M, Piekenbrock M, Doran D (2019). dbscan: Fast Density-Based
#' Clustering with R.  _Journal of Statistical Software_, 91(1), 1-30.
#' \doi{10.18637/jss.v091.i01}
#'
#' Erich Schubert, Michael Gertz (2018). Improving the Cluster Structure
#' Extracted from OPTICS Plots. In _Lernen, Wissen, Daten, Analysen (LWDA 2018),_
#' pp. 318-329.
#' @keywords model clustering
#' @examples
#' set.seed(2)
#' n <- 400
#'
#' x <- cbind(
#'   x = runif(4, 0, 1) + rnorm(n, sd = 0.1),
#'   y = runif(4, 0, 1) + rnorm(n, sd = 0.1)
#'   )
#'
#' plot(x, col=rep(1:4, times = 100))
#'
#' ### run OPTICS (Note: we use the default eps calculation)
#' res <- optics(x, minPts = 10)
#' res
#'
#' ### get order
#' res$order
#'
#' ### plot produces a reachability plot
#' plot(res)
#'
#' ### plot the order of points in the reachability plot
#' plot(x, col = "grey")
#' polygon(x[res$order, ])
#'
#' ### extract a DBSCAN clustering by cutting the reachability plot at eps_cl
#' res <- extractDBSCAN(res, eps_cl = .065)
#' res
#'
#' plot(res)  ## black is noise
#' hullplot(x, res)
#'
#' ### re-cut at a higher eps threshold
#' res <- extractDBSCAN(res, eps_cl = .07)
#' res
#' plot(res)
#' hullplot(x, res)
#'
#' ### extract hierarchical clustering of varying density using the Xi method
#' res <- extractXi(res, xi = 0.01)
#' res
#'
#' plot(res)
#' hullplot(x, res)
#'
#' # Xi cluster structure
#' res$clusters_xi
#'
#' ### use OPTICS on a precomputed distance matrix
#' d <- dist(x)
#' res <- optics(d, minPts = 10)
#' plot(res)
#' @export
optics <- function(x, eps = NULL, minPts = 5, ...) {
  ### find eps from minPts
  eps <- eps %||% max(kNNdist(x, k =  minPts))

  ### extra contains settings for frNN
  ### search = "kdtree", bucketSize = 10, splitRule = "suggest", approx = 0
  extra <- list(...)
  args <- c("search", "bucketSize", "splitRule", "approx")
  m <- pmatch(names(extra), args)
  if (anyNA(m))
    stop("Unknown parameter: ",
      toString(names(extra)[is.na(m)]))
  names(extra) <- args[m]

  search <- .parse_search(extra$search %||% "kdtree")
  splitRule <- .parse_splitRule(extra$splitRule %||% "suggest")
  bucketSize <- as.integer(extra$bucketSize %||% 10L)
  approx <- as.integer(extra$approx %||% 0L)

  ### dist search
  if (search == 3L && !inherits(x, "dist")) {
    if (.matrixlike(x))
      x <- dist(x)
    else
      stop("x needs to be a matrix to calculate distances")
  }

  ## for dist we provide the R code with a frNN list and no x
  frNN <- list()
  if (inherits(x, "dist")) {
    frNN <- frNN(x, eps, ...)
    ## add self match and use C numbering
    frNN$id <- lapply(
      seq_along(frNN$id),
      FUN = function(i)
        c(i - 1L, frNN$id[[i]] - 1L)
    )
    frNN$dist <- lapply(
      seq_along(frNN$dist),
      FUN = function(i)
        c(0, frNN$dist[[i]]) ^ 2
    )

    x <- matrix()
    storage.mode(x) <- "double"

  } else{
    if (!.matrixlike(x))
      stop("x needs to be a matrix")
    ## make sure x is numeric
    x <- as.matrix(x)
    if (storage.mode(x) == "integer")
      storage.mode(x) <- "double"
    if (storage.mode(x) != "double")
      stop("x has to be a numeric matrix.")
  }

  if (length(frNN) == 0 &&
      anyNA(x))
    stop("data/distances cannot contain NAs for optics (with kd-tree)!")

  ret <-
    optics_int(
      as.matrix(x),
      as.double(eps),
      as.integer(minPts),
      as.integer(search),
      as.integer(bucketSize),
      as.integer(splitRule),
      as.double(approx),
      frNN
    )

  ret$minPts <- minPts
  ret$eps <- eps
  ret$eps_cl <- NA_real_
  ret$xi <- NA_real_
  class(ret) <- "optics"

  ret
}

#' @rdname optics
#' @export
print.optics <- function(x, ...) {
  writeLines(c(
    paste0(
      "OPTICS ordering/clustering for ",
      length(x$order),
      " objects."
    ),
    paste0(
      "Parameters: ",
      "m

Download .txt

gitextract_jkl9o70t/

├── .Rbuildignore
├── .github/
│   └── .gitignore
├── .gitignore
├── DESCRIPTION
├── LICENSE
├── NAMESPACE
├── NEWS.md
├── R/
│   ├── AAA_dbscan-package.R
│   ├── AAA_definitions.R
│   ├── DBCV_datasets.R
│   ├── DS3.R
│   ├── GLOSH.R
│   ├── LOF.R
│   ├── NN.R
│   ├── RcppExports.R
│   ├── broom-dbscan-tidiers.R
│   ├── comps.R
│   ├── dbcv.R
│   ├── dbscan.R
│   ├── dendrogram.R
│   ├── extractFOSC.R
│   ├── frNN.R
│   ├── hdbscan.R
│   ├── hullplot.R
│   ├── jpclust.R
│   ├── kNN.R
│   ├── kNNdist.R
│   ├── moons.R
│   ├── ncluster.R
│   ├── nobs.R
│   ├── optics.R
│   ├── pointdensity.R
│   ├── predict.R
│   ├── reachability.R
│   ├── sNN.R
│   ├── sNNclust.R
│   ├── utils.R
│   └── zzz.R
├── README.Rmd
├── README.md
├── data/
│   ├── DS3.rdata
│   ├── Dataset_1.rda
│   ├── Dataset_2.rda
│   ├── Dataset_3.rda
│   ├── Dataset_4.rda
│   └── moons.rdata
├── data_src/
│   ├── data_DBCV/
│   │   ├── dataset_1.txt
│   │   ├── dataset_2.txt
│   │   ├── dataset_3.txt
│   │   ├── dataset_4.txt
│   │   ├── read_data.R
│   │   └── test_DBCV.R
│   └── data_chameleon/
│       └── read.R
├── dbscan.Rproj
├── inst/
│   └── CITATION
├── man/
│   ├── DBCV_datasets.Rd
│   ├── DS3.Rd
│   ├── NN.Rd
│   ├── comps.Rd
│   ├── dbcv.Rd
│   ├── dbscan-package.Rd
│   ├── dbscan.Rd
│   ├── dbscan_tidiers.Rd
│   ├── dendrogram.Rd
│   ├── extractFOSC.Rd
│   ├── frNN.Rd
│   ├── glosh.Rd
│   ├── hdbscan.Rd
│   ├── hullplot.Rd
│   ├── jpclust.Rd
│   ├── kNN.Rd
│   ├── kNNdist.Rd
│   ├── lof.Rd
│   ├── moons.Rd
│   ├── ncluster.Rd
│   ├── optics.Rd
│   ├── pointdensity.Rd
│   ├── reachability.Rd
│   ├── sNN.Rd
│   └── sNNclust.Rd
├── src/
│   ├── ANN/
│   │   ├── ANN.cpp
│   │   ├── ANN.h
│   │   ├── ANNperf.h
│   │   ├── ANNx.h
│   │   ├── Copyright.txt
│   │   ├── License.txt
│   │   ├── ReadMe.txt
│   │   ├── bd_fix_rad_search.cpp
│   │   ├── bd_pr_search.cpp
│   │   ├── bd_search.cpp
│   │   ├── bd_tree.cpp
│   │   ├── bd_tree.h
│   │   ├── brute.cpp
│   │   ├── kd_dump.cpp
│   │   ├── kd_fix_rad_search.cpp
│   │   ├── kd_fix_rad_search.h
│   │   ├── kd_pr_search.cpp
│   │   ├── kd_pr_search.h
│   │   ├── kd_search.cpp
│   │   ├── kd_search.h
│   │   ├── kd_split.cpp
│   │   ├── kd_split.h
│   │   ├── kd_tree.cpp
│   │   ├── kd_tree.h
│   │   ├── kd_util.cpp
│   │   ├── kd_util.h
│   │   ├── perf.cpp
│   │   ├── pr_queue.h
│   │   └── pr_queue_k.h
│   ├── JP.cpp
│   ├── Makevars
│   ├── RcppExports.cpp
│   ├── UnionFind.cpp
│   ├── UnionFind.h
│   ├── cleanup.cpp
│   ├── connectedComps.cpp
│   ├── dbcv.cpp
│   ├── dbscan.cpp
│   ├── dendrogram.cpp
│   ├── density.cpp
│   ├── frNN.cpp
│   ├── hdbscan.cpp
│   ├── kNN.cpp
│   ├── kNN.h
│   ├── lof.cpp
│   ├── lt.h
│   ├── mrd.cpp
│   ├── mst.cpp
│   ├── mst.h
│   ├── optics.cpp
│   ├── regionQuery.cpp
│   ├── regionQuery.h
│   ├── utilities.cpp
│   └── utilities.h
├── tests/
│   ├── testthat/
│   │   ├── fixtures/
│   │   │   ├── elki_optics.rda
│   │   │   ├── elki_optics_xi.rda
│   │   │   └── test_data.rda
│   │   ├── test-dbcv.R
│   │   ├── test-dbscan.R
│   │   ├── test-fosc.R
│   │   ├── test-frNN.R
│   │   ├── test-hdbscan.R
│   │   ├── test-kNN.R
│   │   ├── test-kNNdist.R
│   │   ├── test-lof.R
│   │   ├── test-mst.R
│   │   ├── test-optics.R
│   │   ├── test-opticsXi.R
│   │   ├── test-predict.R
│   │   └── test-sNN.R
│   └── testthat.R
└── vignettes/
    ├── dbscan.Rnw
    ├── dbscan.bib
    └── hdbscan.Rmd

Download .txt

SYMBOL INDEX (185 symbols across 33 files)

FILE: src/ANN/ANN.cpp
  function ANNdist (line 47) | ANNdist annDist(						// interpoint squared distance
  function annPrintPt (line 71) | void annPrintPt(						// print a point
  function ANNpoint (line 111) | ANNpoint annAllocPt(int dim, ANNcoord c)		// allocate 1 point
  function ANNpointArray (line 118) | ANNpointArray annAllocPts(int n, int dim)		// allocate n pts in dim
  function annDeallocPt (line 128) | void annDeallocPt(ANNpoint &p)					// deallocate 1 point
  function annDeallocPts (line 134) | void annDeallocPts(ANNpointArray &pa)			// deallocate points
  function ANNpoint (line 141) | ANNpoint annCopyPt(int dim, ANNpoint source)	// copy point
  function annAssignRect (line 149) | void annAssignRect(int dim, ANNorthRect &dest, const ANNorthRect &source)
  function ANNbool (line 158) | ANNbool ANNorthRect::inside(int dim, ANNpoint p)
  function annError (line 170) | void annError(const char *msg, ANNerr level)
  function annMaxPtsVisit (line 200) | void annMaxPtsVisit(			// set limit on max. pts to visit in search

FILE: src/ANN/ANN.h
  type ANNbool (line 130) | enum ANNbool {ANNfalse = 0, ANNtrue = 1}
  type ANNcoord (line 156) | typedef double	ANNcoord;
  type ANNdist (line 157) | typedef double	ANNdist;
  type ANNidx (line 173) | typedef int		ANNidx;
  type ANNcoord (line 374) | typedef ANNcoord* ANNpoint;
  type ANNpoint (line 375) | typedef ANNpoint* ANNpointArray;
  type ANNdist (line 376) | typedef ANNdist*  ANNdistArray;
  type ANNidx (line 377) | typedef ANNidx*   ANNidxArray;
  function class (line 490) | class DLL_API ANNpointSet {
  function theDim (line 574) | int theDim()						// return dimension of space
  function nPoints (line 577) | int nPoints()						// return number of points
  function ANNpointArray (line 580) | ANNpointArray thePoints()			// return pointer to points
  type ANNsplitRule (line 605) | enum ANNsplitRule {
  type ANNshrinkRule (line 614) | enum ANNshrinkRule {
  type ANNkd_node (line 712) | typedef ANNkd_node*	ANNkd_ptr;
  function theDim (line 779) | int theDim()						// return dimension of space
  function nPoints (line 782) | int nPoints()						// return number of points
  function ANNpointArray (line 785) | ANNpointArray thePoints()			// return pointer to points

FILE: src/ANN/ANNperf.h
  function class (line 45) | class ANNkdStats {			// stats on kd-tree
  function class (line 85) | class DLL_API ANNsampStat {

FILE: src/ANN/ANNx.h
  type ANNerr (line 46) | enum ANNerr {ANNwarn = 0, ANNabort = 1}
  function class (line 89) | class ANNorthRect {
  function class (line 130) | class ANNorthHalfSpace {
  function ANNbool (line 145) | ANNbool in(ANNpoint q) const	// is q inside halfspace?
  function ANNbool (line 148) | ANNbool out(ANNpoint q) const	// is q outside halfspace?
  function ANNdist (line 151) | ANNdist dist(ANNpoint q) const	// (squared) distance from q
  function setLowerBound (line 154) | void setLowerBound(int d, ANNpoint p)// set to lower bound at p[i]
  function setUpperBound (line 157) | void setUpperBound(int d, ANNpoint p)// set to upper bound at p[i]
  function project (line 160) | void project(ANNpoint &q)		// project q (modified) onto halfspace
  type ANNorthHalfSpace (line 165) | typedef ANNorthHalfSpace *ANNorthHSArray;

FILE: src/ANN/bd_tree.cpp
  type ANNdecomp (line 163) | enum ANNdecomp {SPLIT, SHRINK}
  function ANNdecomp (line 182) | ANNdecomp trySimpleShrink(				// try a simple shrink
  function ANNdecomp (line 237) | ANNdecomp tryCentroidShrink(			// try a centroid shrink
  function ANNdecomp (line 280) | ANNdecomp selectDecomp(			// select decomposition method
  function ANNkd_ptr (line 336) | ANNkd_ptr rbd_tree(				// recursive construction of bd-tree

FILE: src/ANN/bd_tree.h
  function class (line 61) | class ANNbd_shrink : public ANNkd_node	// splitting node of a kd-tree

FILE: src/ANN/kd_dump.cpp
  type ANNtreeType (line 53) | enum ANNtreeType {KD_TREE, BD_TREE}
  function ANNkd_ptr (line 260) | static ANNkd_ptr annReadDump(
  function ANNkd_ptr (line 375) | static ANNkd_ptr annReadTree(

FILE: src/ANN/kd_split.cpp
  function kd_split (line 44) | void kd_split(
  function midpt_split (line 76) | void midpt_split(
  function sl_midpt_split (line 146) | void sl_midpt_split(
  function fair_split (line 243) | void fair_split(
  function sl_fair_split (line 346) | void sl_fair_split(

FILE: src/ANN/kd_tree.cpp
  function annClose (line 221) | void annClose()				// close use of ANN
  function ANNkd_ptr (line 314) | ANNkd_ptr rkd_tree(				// recursive construction of kd-tree

FILE: src/ANN/kd_tree.h
  function class (line 46) | class ANNkd_node{						// generic kd-tree node (empty shell)
  function class (line 91) | class ANNkd_leaf: public ANNkd_node		// leaf node for kd-tree
  function class (line 142) | class ANNkd_split : public ANNkd_node	// splitting node of a kd-tree

FILE: src/ANN/kd_util.cpp
  function annAspectRatio (line 52) | double annAspectRatio(
  function annEnclRect (line 73) | void annEnclRect(
  function annEnclCube (line 92) | void annEnclCube(						// compute smallest enclosing cube
  function ANNdist (line 124) | ANNdist annBoxDistance(			// compute distance from point to box
  function ANNcoord (line 154) | ANNcoord annSpread(				// compute point spread along dimension
  function annMinMax (line 170) | void annMinMax(					// compute min and max coordinates along dim
  function annMaxSpread (line 187) | int annMaxSpread(						// compute dimension of max spread
  function annMedianSplit (line 230) | void annMedianSplit(
  function annPlaneSplit (line 291) | void annPlaneSplit(				// split points by a plane
  function annBoxSplit (line 332) | void annBoxSplit(				// split points by a box
  function annSplitBalance (line 360) | int annSplitBalance(			// determine balance factor of a split
  function annBox2Bnds (line 384) | void annBox2Bnds(						// convert inner box to bounds
  function annBnds2Box (line 426) | void annBnds2Box(

FILE: src/ANN/perf.cpp
  function DLL_API (line 69) | DLL_API void annResetStats(int data_size) // reset stats for a set of qu...
  function DLL_API (line 83) | DLL_API void annResetCounts()				// reset counts for one query
  function DLL_API (line 93) | DLL_API void annUpdateStats()				// update stats with current counts
  function print_one_stat (line 105) | void print_one_stat(const char *title, ANNsampStat s, double div)
  function DLL_API (line 114) | DLL_API void annPrintStats(				// print statistics for a run

FILE: src/ANN/pr_queue.h
  type ANNdist (line 36) | typedef ANNdist			PQkey;
  function class (line 54) | class ANNpr_queue {
  function ANNbool (line 75) | ANNbool empty()						// is queue empty?
  function ANNbool (line 78) | ANNbool non_empty()					// is queue nonempty?
  function reset (line 81) | void reset()						// make existing queue empty
  function insert (line 84) | inline void insert(					// insert item (inlined for speed)
  function extr_min (line 102) | inline void extr_min(				// extract minimum (inlined for speed)

FILE: src/ANN/pr_queue_k.h
  type ANNdist (line 34) | typedef ANNdist			PQKkey;
  type PQKinfo (line 35) | typedef int				PQKinfo;
  function class (line 66) | class ANNmin_k {
  function PQKkey (line 87) | PQKkey ANNmin_key()					// return minimum key
  function PQKkey (line 90) | PQKkey max_key()					// return maximum key
  function PQKkey (line 93) | PQKkey ith_smallest_key(int i)		// ith smallest key (i in [0..n-1])
  function PQKinfo (line 96) | PQKinfo ith_smallest_info(int i)	// info for ith smallest (i in [0..n-1])
  function insert (line 99) | inline void insert(					// insert item (inlined for speed)

FILE: src/JP.cpp
  function IntegerVector (line 16) | IntegerVector JP_int(IntegerMatrix nn, unsigned int kt) {
  function IntegerMatrix (line 88) | IntegerMatrix SNN_sim_int(IntegerMatrix nn, LogicalVector jp) {

FILE: src/RcppExports.cpp
  function RcppExport (line 15) | RcppExport SEXP _dbscan_JP_int(SEXP nnSEXP, SEXP ktSEXP) {
  function RcppExport (line 27) | RcppExport SEXP _dbscan_SNN_sim_int(SEXP nnSEXP, SEXP jpSEXP) {
  function RcppExport (line 39) | RcppExport SEXP _dbscan_ANN_cleanup() {
  function RcppExport (line 48) | RcppExport SEXP _dbscan_comps_kNN(SEXP nnSEXP, SEXP mutualSEXP) {
  function RcppExport (line 60) | RcppExport SEXP _dbscan_comps_frNN(SEXP nnSEXP, SEXP mutualSEXP) {
  function RcppExport (line 72) | RcppExport SEXP _dbscan_intToStr(SEXP ivSEXP) {
  function RcppExport (line 83) | RcppExport SEXP _dbscan_dist_subset(SEXP distSEXP, SEXP idxSEXP) {
  function RcppExport (line 95) | RcppExport SEXP _dbscan_XOR(SEXP lhsSEXP, SEXP rhsSEXP) {
  function RcppExport (line 107) | RcppExport SEXP _dbscan_dspc(SEXP cl_idxSEXP, SEXP internal_nodesSEXP, S...
  function RcppExport (line 121) | RcppExport SEXP _dbscan_dbscan_int(SEXP dataSEXP, SEXP epsSEXP, SEXP min...
  function RcppExport (line 141) | RcppExport SEXP _dbscan_reach_to_dendrogram(SEXP reachabilitySEXP, SEXP ...
  function RcppExport (line 153) | RcppExport SEXP _dbscan_dendrogram_to_reach(SEXP xSEXP) {
  function RcppExport (line 164) | RcppExport SEXP _dbscan_mst_to_dendrogram(SEXP mstSEXP) {
  function RcppExport (line 175) | RcppExport SEXP _dbscan_dbscan_density_int(SEXP dataSEXP, SEXP epsSEXP, ...
  function RcppExport (line 191) | RcppExport SEXP _dbscan_frNN_int(SEXP dataSEXP, SEXP epsSEXP, SEXP typeS...
  function RcppExport (line 207) | RcppExport SEXP _dbscan_frNN_query_int(SEXP dataSEXP, SEXP querySEXP, SE...
  function RcppExport (line 224) | RcppExport SEXP _dbscan_distToAdjacency(SEXP constraintsSEXP, SEXP NSEXP) {
  function RcppExport (line 236) | RcppExport SEXP _dbscan_buildDendrogram(SEXP hclSEXP) {
  function RcppExport (line 247) | RcppExport SEXP _dbscan_all_children(SEXP hierSEXP, SEXP keySEXP, SEXP l...
  function RcppExport (line 260) | RcppExport SEXP _dbscan_node_xy(SEXP cl_treeSEXP, SEXP cl_hierarchySEXP,...
  function RcppExport (line 273) | RcppExport SEXP _dbscan_simplifiedTree(SEXP cl_treeSEXP) {
  function RcppExport (line 284) | RcppExport SEXP _dbscan_computeStability(SEXP hclSEXP, SEXP minPtsSEXP, ...
  function RcppExport (line 297) | RcppExport SEXP _dbscan_validateConstraintList(SEXP constraintsSEXP, SEX...
  function RcppExport (line 309) | RcppExport SEXP _dbscan_computeVirtualNode(SEXP noiseSEXP, SEXP constrai...
  function RcppExport (line 321) | RcppExport SEXP _dbscan_fosc(SEXP cl_treeSEXP, SEXP cidSEXP, SEXP scSEXP...
  function RcppExport (line 341) | RcppExport SEXP _dbscan_extractUnsupervised(SEXP cl_treeSEXP, SEXP prune...
  function RcppExport (line 354) | RcppExport SEXP _dbscan_extractSemiSupervised(SEXP cl_treeSEXP, SEXP con...
  function RcppExport (line 369) | RcppExport SEXP _dbscan_kNN_query_int(SEXP dataSEXP, SEXP querySEXP, SEX...
  function RcppExport (line 386) | RcppExport SEXP _dbscan_kNN_int(SEXP dataSEXP, SEXP kSEXP, SEXP typeSEXP...
  function RcppExport (line 402) | RcppExport SEXP _dbscan_lof_kNN(SEXP dataSEXP, SEXP minPtsSEXP, SEXP typ...
  function RcppExport (line 418) | RcppExport SEXP _dbscan_mrd(SEXP dmSEXP, SEXP cdSEXP) {
  function RcppExport (line 430) | RcppExport SEXP _dbscan_mst(SEXP x_distSEXP, SEXP nSEXP) {
  function RcppExport (line 442) | RcppExport SEXP _dbscan_hclustMergeOrder(SEXP mstSEXP, SEXP oSEXP) {
  function RcppExport (line 454) | RcppExport SEXP _dbscan_optics_int(SEXP dataSEXP, SEXP epsSEXP, SEXP min...
  function RcppExport (line 472) | RcppExport SEXP _dbscan_lowerTri(SEXP mSEXP) {
  function RcppExport (line 521) | RcppExport void R_init_dbscan(DllInfo *dll) {

FILE: src/UnionFind.h
  function class (line 21) | class UnionFind

FILE: src/cleanup.cpp
  function ANN_cleanup (line 16) | void ANN_cleanup() {

FILE: src/connectedComps.cpp
  function IntegerVector (line 17) | IntegerVector comps_kNN(IntegerMatrix nn, bool mutual) {
  function IntegerVector (line 73) | IntegerVector comps_frNN(List nn, bool mutual) {

FILE: src/dbcv.cpp
  function StringVector (line 25) | StringVector intToStr(IntegerVector iv){
  function toMap (line 34) | std::unordered_map<std::string, double> toMap(List map){
  function NumericVector (line 44) | NumericVector retrieve(StringVector keys, std::unordered_map<std::string...
  function NumericVector (line 52) | NumericVector dist_subset_arma(const NumericVector& dist, IntegerVector ...
  function NumericVector (line 65) | NumericVector dist_subset(const NumericVector& dist, IntegerVector idx){
  function remove_zero (line 83) | bool remove_zero(ANNdist cdist){
  function ANNdist (line 87) | ANNdist inv_density(ANNdist cdist){
  function XOR (line 208) | Rcpp::LogicalVector XOR(Rcpp::LogicalVector lhs, Rcpp::LogicalVector rhs) {
  function NumericMatrix (line 216) | NumericMatrix dspc(const List& cl_idx, const List& internal_nodes, const...

FILE: src/dbscan.cpp
  function IntegerVector (line 24) | IntegerVector dbscan_int(

FILE: src/dendrogram.cpp
  function fast_atoi (line 18) | int fast_atoi( const char * str )
  function which_int (line 27) | int which_int(IntegerVector x, int target) {
  function List (line 37) | List reach_to_dendrogram(const Rcpp::List reachability, const NumericVec...
  function DFS (line 95) | int DFS(List d, List& rp, int pnode, NumericVector stack) {
  function List (line 131) | List dendrogram_to_reach(const Rcpp::List x) {
  function List (line 142) | List mst_to_dendrogram(const NumericMatrix mst) {

FILE: src/density.cpp
  function IntegerVector (line 20) | IntegerVector dbscan_density_int(

FILE: src/frNN.cpp
  function List (line 19) | List frNN_int(NumericMatrix data, double eps, int type,
  function List (line 93) | List frNN_query_int(NumericMatrix data, NumericMatrix query, double eps,...

FILE: src/hdbscan.cpp
  function List (line 31) | List distToAdjacency(IntegerVector constraints, const int N){
  function List (line 49) | List buildDendrogram(List hcl) {
  function IntegerVector (line 138) | IntegerVector all_children(List hier, int key, bool leaves_only = false){
  function IntegerVector (line 173) | IntegerVector getSalientAssignments(List cl_tree, List cl_hierarchy, std...
  function NumericMatrix (line 199) | NumericMatrix node_xy(List cl_tree, List cl_hierarchy, int cid = 0){
  function List (line 249) | List simplifiedTree(List cl_tree) {
  function List (line 342) | List computeStability(const List hcl, const int minPts, bool compute_glo...
  function List (line 489) | List validateConstraintList(List& constraints, int n){
  function computeVirtualNode (line 601) | double computeVirtualNode(IntegerVector noise, List constraints){
  function NumericVector (line 637) | NumericVector fosc(List cl_tree, std::string cid, std::list<int>& sc, Li...
  function List (line 762) | List extractUnsupervised(List cl_tree, bool prune_unstable = false, doub...
  function List (line 776) | List extractSemiSupervised(List cl_tree, List constraints, float alpha =...

FILE: src/kNN.cpp
  function List (line 16) | List kNN_int(NumericMatrix data, int k,
  function List (line 84) | List kNN_query_int(NumericMatrix data, NumericMatrix query, int k,

FILE: src/lof.cpp
  function List (line 21) | List lof_kNN(NumericMatrix data, int minPts,

FILE: src/mrd.cpp
  function NumericVector (line 29) | NumericVector mrd(NumericVector dm, NumericVector cd) {

FILE: src/mst.cpp
  function mst (line 37) | Rcpp::NumericMatrix mst(const NumericVector x_dist, const R_xlen_t n) {
  function visit (line 99) | void visit(const IntegerMatrix& merge, IntegerVector& order, int i, int ...
  function IntegerVector (line 110) | IntegerVector extractOrder(IntegerMatrix merge){
  function List (line 119) | List hclustMergeOrder(NumericMatrix mst, IntegerVector o){

FILE: src/optics.cpp
  function update (line 18) | void update(
  function List (line 60) | List optics_int(NumericMatrix data, double eps, int minPts,

FILE: src/regionQuery.cpp
  function nn (line 20) | nn regionQueryDist(int id, ANNpointArray dataPts, ANNpointSet* kdTree,
  function regionQuery (line 32) | std::vector<int> regionQuery(int id, ANNpointArray dataPts, ANNpointSet*...
  function nn (line 46) | nn regionQueryDist_point(ANNpoint queryPt, ANNpointArray dataPts,
  function regionQuery_point (line 57) | std::vector<int> regionQuery_point(ANNpoint queryPt, ANNpointArray dataPts,

FILE: src/regionQuery.h
  type std (line 20) | typedef std::pair< std::vector<int>, std::vector<double> > nn ;

FILE: src/utilities.cpp
  function IntegerVector (line 15) | IntegerVector lowerTri(IntegerMatrix m) {
  function NumericVector (line 26) | NumericVector combine(const NumericVector& t1, const NumericVector& t2) {
  function IntegerVector (line 34) | IntegerVector combine(const IntegerVector& t1, const IntegerVector& t2) {
  function IntegerVector (line 44) | IntegerVector concat_int(List const& container) {

Download .json

Condensed preview — 154 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (1,042K chars).

[
  {
    "path": ".Rbuildignore",
    "chars": 134,
    "preview": "proj$\n^\\.Rproj\\.user$\n^cran-comments\\.md$\n^appveyor\\.yml$\n^revdep$\n^.*\\.o$\n^.*\\.Rproj$\n^LICENSE\nREADME.Rmd\ndata_src\nigno"
  },
  {
    "path": ".github/.gitignore",
    "chars": 7,
    "preview": "*.html\n"
  },
  {
    "path": ".gitignore",
    "chars": 295,
    "preview": "# Generated files \n*.o\n*.so\n\n# History files\n.Rhistory\n.Rapp.history\n.RData\n*.Rcheck\n\n\n# Example code in package build p"
  },
  {
    "path": "DESCRIPTION",
    "chars": 1964,
    "preview": "Package: dbscan\nTitle: Density-Based Spatial Clustering of Applications with Noise\n    (DBSCAN) and Related Algorithms\nV"
  },
  {
    "path": "LICENSE",
    "chars": 35142,
    "preview": "                    GNU GENERAL PUBLIC LICENSE\n                       Version 3, 29 June 2007\n\n Copyright (C) 2007 Free "
  },
  {
    "path": "NAMESPACE",
    "chars": 2497,
    "preview": "# Generated by roxygen2: do not edit by hand\n\nS3method(adjacencylist,NN)\nS3method(adjacencylist,frNN)\nS3method(adjacency"
  },
  {
    "path": "NEWS.md",
    "chars": 8718,
    "preview": "# dbscan 1.2.4 (2025-12-18)\n\n## Bugfixes\n* dbscan now checks for matrices with 0 rows or 0 columns\n  (reported by maldri"
  },
  {
    "path": "R/AAA_dbscan-package.R",
    "chars": 725,
    "preview": "#' @keywords internal\n#'\n#' @section Key functions:\n#' - Clustering: [dbscan()], [hdbscan()], [optics()], [jpclust()], ["
  },
  {
    "path": "R/AAA_definitions.R",
    "chars": 1234,
    "preview": "#######################################################################\n# dbscan - Density Based Clustering of Applicati"
  },
  {
    "path": "R/DBCV_datasets.R",
    "chars": 2007,
    "preview": "#######################################################################\n# dbscan - Density Based Clustering of Applicati"
  },
  {
    "path": "R/DS3.R",
    "chars": 1843,
    "preview": "#######################################################################\n# dbscan - Density Based Clustering of Applicati"
  },
  {
    "path": "R/GLOSH.R",
    "chars": 4785,
    "preview": "#######################################################################\n# dbscan - Density Based Clustering of Applicati"
  },
  {
    "path": "R/LOF.R",
    "chars": 6682,
    "preview": "#######################################################################\n# dbscan - Density Based Clustering of Applicati"
  },
  {
    "path": "R/NN.R",
    "chars": 3858,
    "preview": "#######################################################################\n# dbscan - Density Based Clustering of Applicati"
  },
  {
    "path": "R/RcppExports.R",
    "chars": 4501,
    "preview": "# Generated by using Rcpp::compileAttributes() -> do not edit by hand\n# Generator token: 10BE3573-1514-4C36-9D1C-5A225CD"
  },
  {
    "path": "R/broom-dbscan-tidiers.R",
    "chars": 5727,
    "preview": "#' Turn an dbscan clustering object into a tidy tibble\n#'\n#' Provides [tidy()][generics::tidy()], [augment()][generics::"
  },
  {
    "path": "R/comps.R",
    "chars": 3217,
    "preview": "#######################################################################\n# dbscan - Density Based Clustering of Applicati"
  },
  {
    "path": "R/dbcv.R",
    "chars": 9936,
    "preview": "#######################################################################\n# dbscan - Density Based Clustering of Applicati"
  },
  {
    "path": "R/dbscan.R",
    "chars": 15243,
    "preview": "#######################################################################\n# dbscan - Density Based Clustering of Applicati"
  },
  {
    "path": "R/dendrogram.R",
    "chars": 4689,
    "preview": "#######################################################################\n# dbscan - Density Based Clustering of Applicati"
  },
  {
    "path": "R/extractFOSC.R",
    "chars": 15162,
    "preview": "#######################################################################\n# dbscan - Density Based Clustering of Applicati"
  },
  {
    "path": "R/frNN.R",
    "chars": 9059,
    "preview": "#######################################################################\n# dbscan - Density Based Clustering of Applicati"
  },
  {
    "path": "R/hdbscan.R",
    "chars": 18803,
    "preview": "#######################################################################\n# dbscan - Density Based Clustering of Applicati"
  },
  {
    "path": "R/hullplot.R",
    "chars": 6113,
    "preview": "#######################################################################\n# dbscan - Density Based Clustering of Applicati"
  },
  {
    "path": "R/jpclust.R",
    "chars": 4713,
    "preview": "#######################################################################\n# dbscan - Density Based Clustering of Applicati"
  },
  {
    "path": "R/kNN.R",
    "chars": 10237,
    "preview": "#######################################################################\n# dbscan - Density Based Clustering of Applicati"
  },
  {
    "path": "R/kNNdist.R",
    "chars": 3824,
    "preview": "#######################################################################\n# dbscan - Density Based Clustering of Applicati"
  },
  {
    "path": "R/moons.R",
    "chars": 2143,
    "preview": "#######################################################################\n# dbscan - Density Based Clustering of Applicati"
  },
  {
    "path": "R/ncluster.R",
    "chars": 2559,
    "preview": "#######################################################################\n# dbscan - Density Based Clustering of Applicati"
  },
  {
    "path": "R/nobs.R",
    "chars": 256,
    "preview": "\n#' @importFrom stats nobs\n#' @export\nnobs.dbscan <- function(object, ...) length(object$cluster)\n\n#' @export\nnobs.hdbsc"
  },
  {
    "path": "R/optics.R",
    "chars": 23118,
    "preview": "#######################################################################\n# dbscan - Density Based Clustering of Applicati"
  },
  {
    "path": "R/pointdensity.R",
    "chars": 5516,
    "preview": "#######################################################################\n# dbscan - Density Based Clustering of Applicati"
  },
  {
    "path": "R/predict.R",
    "chars": 3743,
    "preview": "#######################################################################\n# dbscan - Density Based Clustering of Applicati"
  },
  {
    "path": "R/reachability.R",
    "chars": 7535,
    "preview": "#######################################################################\n# dbscan - Density Based Clustering of Applicati"
  },
  {
    "path": "R/sNN.R",
    "chars": 6622,
    "preview": "#######################################################################\n# dbscan - Density Based Clustering of Applicati"
  },
  {
    "path": "R/sNNclust.R",
    "chars": 4319,
    "preview": "#######################################################################\n# dbscan - Density Based Clustering of Applicati"
  },
  {
    "path": "R/utils.R",
    "chars": 56,
    "preview": "`%||%` <- function(x, y) {\n  if (is.null(x)) y else x\n}\n"
  },
  {
    "path": "R/zzz.R",
    "chars": 154,
    "preview": "# ANN uses a global KD_TRIVIAL structure which needs to be removed.\n.onUnload <- function(libpath) {\n  ANN_cleanup()\n  #"
  },
  {
    "path": "README.Rmd",
    "chars": 5813,
    "preview": "---\noutput: github_document\nbibliography: vignettes/dbscan.bib\nlink-citations: yes\n---\n\n```{r echo=FALSE, results = 'asi"
  },
  {
    "path": "README.md",
    "chars": 15448,
    "preview": "\n# <img src=\"man/figures/logo.svg\" align=\"right\" height=\"139\" /> R package dbscan - Density-Based Spatial Clustering of "
  },
  {
    "path": "data_src/data_DBCV/dataset_1.txt",
    "chars": 16873,
    "preview": "-0.0014755 0.99852 1\n-0.005943 0.98904 1\n0.028184 1.0181 1\n0.019204 1.0041 1\n0.033017 1.0128 1\n0.011014 0.9857 1\n0.03377"
  },
  {
    "path": "data_src/data_DBCV/dataset_2.txt",
    "chars": 29524,
    "preview": "191.67 388.02 1\n186.28 383.39 1\n182.22 397.99 1\n194.54 394.76 1\n183.43 393.87 1\n184.23 388.09 1\n192.33 389.85 1\n190.66 3"
  },
  {
    "path": "data_src/data_DBCV/dataset_3.txt",
    "chars": 26077,
    "preview": "-6.1698 2.2449 1\n-2.6453 6.9494 1\n-4.9691 4.9966 1\n-3.0064 6.868 1\n-4.3216 -3.2774 1\n-0.45173 -4.763 1\n2.2952 1.3859 1\n-"
  },
  {
    "path": "data_src/data_DBCV/dataset_4.txt",
    "chars": 32603,
    "preview": "340.080593000166 401.306241000071 1\r\n333.985499000177 395.070042999927 1\r\n335.612031000201 392.773647000082 1\r\n345.09286"
  },
  {
    "path": "data_src/data_DBCV/read_data.R",
    "chars": 940,
    "preview": "library(dbscan)\n\n\nx <- read.table(\"Work/data_DBCV/dataset_1.txt\")\ncolnames(x) <- c(\"x\", \"y\", \"class\")\n\ncl <- x[, 3]\ncl[c"
  },
  {
    "path": "data_src/data_DBCV/test_DBCV.R",
    "chars": 1435,
    "preview": "# From: https://github.com/FelSiq/DBCV\n#\n# Dataset\tPython (Scipy's Kruskal's)\tPython (Translated MST algorithm)\tMATLAB\n#"
  },
  {
    "path": "data_src/data_chameleon/read.R",
    "chars": 526,
    "preview": "# Source: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download\n\nchameleon_ds4 <- read.table(\"t4.8k.dat\")\nchameleon_ds5 "
  },
  {
    "path": "dbscan.Rproj",
    "chars": 545,
    "preview": "Version: 1.0\nProjectId: 6c2ba941-cfaa-4faa-ba72-88eeef0391b8\n\nRestoreWorkspace: Default\nSaveWorkspace: Default\nAlwaysSav"
  },
  {
    "path": "inst/CITATION",
    "chars": 837,
    "preview": "citation(auto = meta)\r\n\r\nbibentry(bibtype = \"Article\",\r\n  title        = \"{dbscan}: Fast Density-Based Clustering with {"
  },
  {
    "path": "man/DBCV_datasets.Rd",
    "chars": 1193,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/DBCV_datasets.R\n\\docType{data}\n\\name{DBCV_"
  },
  {
    "path": "man/DS3.Rd",
    "chars": 1002,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/DS3.R\n\\docType{data}\n\\name{DS3}\n\\alias{DS3"
  },
  {
    "path": "man/NN.Rd",
    "chars": 1895,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/NN.R\n\\name{NN}\n\\alias{NN}\n\\alias{adjacency"
  },
  {
    "path": "man/comps.Rd",
    "chars": 2190,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/comps.R\n\\name{comps}\n\\alias{comps}\n\\alias{"
  },
  {
    "path": "man/dbcv.Rd",
    "chars": 3750,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/dbcv.R\n\\name{dbcv}\n\\alias{dbcv}\n\\alias{DBC"
  },
  {
    "path": "man/dbscan-package.Rd",
    "chars": 2202,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/AAA_dbscan-package.R\n\\docType{package}\n\\na"
  },
  {
    "path": "man/dbscan.Rd",
    "chars": 10880,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/dbscan.R, R/predict.R\n\\name{dbscan}\n\\alias"
  },
  {
    "path": "man/dbscan_tidiers.Rd",
    "chars": 2831,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/broom-dbscan-tidiers.R\n\\name{dbscan_tidier"
  },
  {
    "path": "man/dendrogram.Rd",
    "chars": 1301,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/dendrogram.R\n\\name{dendrogram}\n\\alias{dend"
  },
  {
    "path": "man/extractFOSC.Rd",
    "chars": 9622,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/extractFOSC.R\n\\name{extractFOSC}\n\\alias{ex"
  },
  {
    "path": "man/frNN.Rd",
    "chars": 3802,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/frNN.R\n\\name{frNN}\n\\alias{frNN}\n\\alias{frn"
  },
  {
    "path": "man/glosh.Rd",
    "chars": 3029,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/GLOSH.R\n\\name{glosh}\n\\alias{glosh}\n\\alias{"
  },
  {
    "path": "man/hdbscan.Rd",
    "chars": 9439,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/hdbscan.R, R/predict.R\n\\name{hdbscan}\n\\ali"
  },
  {
    "path": "man/hullplot.Rd",
    "chars": 2801,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/hullplot.R\n\\name{hullplot}\n\\alias{hullplot"
  },
  {
    "path": "man/jpclust.Rd",
    "chars": 2946,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/jpclust.R\n\\name{jpclust}\n\\alias{jpclust}\n\\"
  },
  {
    "path": "man/kNN.Rd",
    "chars": 4406,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/kNN.R\n\\name{kNN}\n\\alias{kNN}\n\\alias{knn}\n\\"
  },
  {
    "path": "man/kNNdist.Rd",
    "chars": 2657,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/kNNdist.R\n\\name{kNNdist}\n\\alias{kNNdist}\n\\"
  },
  {
    "path": "man/lof.Rd",
    "chars": 3023,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/LOF.R\n\\name{lof}\n\\alias{lof}\n\\alias{LOF}\n\\"
  },
  {
    "path": "man/moons.Rd",
    "chars": 1320,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/moons.R\n\\docType{data}\n\\name{moons}\n\\alias"
  },
  {
    "path": "man/ncluster.Rd",
    "chars": 1311,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/ncluster.R\n\\name{ncluster}\n\\alias{ncluster"
  },
  {
    "path": "man/optics.Rd",
    "chars": 7879,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/optics.R, R/predict.R\n\\name{optics}\n\\alias"
  },
  {
    "path": "man/pointdensity.Rd",
    "chars": 3374,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/pointdensity.R\n\\name{pointdensity}\n\\alias{"
  },
  {
    "path": "man/reachability.Rd",
    "chars": 5390,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/reachability.R\n\\name{reachability}\n\\alias{"
  },
  {
    "path": "man/sNN.Rd",
    "chars": 4078,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/sNN.R\n\\name{sNN}\n\\alias{sNN}\n\\alias{snn}\n\\"
  },
  {
    "path": "man/sNNclust.Rd",
    "chars": 3160,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/sNNclust.R\n\\name{sNNclust}\n\\alias{sNNclust"
  },
  {
    "path": "src/ANN/ANN.cpp",
    "chars": 6552,
    "preview": "//----------------------------------------------------------------------\n// File:\t\t\tANN.cpp\n// Programmer:\t\tSunil Arya a"
  },
  {
    "path": "src/ANN/ANN.h",
    "chars": 35460,
    "preview": "//----------------------------------------------------------------------\n// File:\t\t\tANN.h\n// Programmer:\t\tSunil Arya and"
  },
  {
    "path": "src/ANN/ANNperf.h",
    "chars": 8202,
    "preview": "//----------------------------------------------------------------------\n//\tFile:\t\t\tANNperf.h\n//\tProgrammer:\t\tSunil Arya"
  },
  {
    "path": "src/ANN/ANNx.h",
    "chars": 6234,
    "preview": "//----------------------------------------------------------------------\n//\tFile:\t\t\tANNx.h\n//\tProgrammer: \tSunil Arya an"
  },
  {
    "path": "src/ANN/Copyright.txt",
    "chars": 1580,
    "preview": "ANN: Approximate Nearest Neighbors\nVersion: 1.1\nRelease Date: May 3, 2005\n----------------------------------------------"
  },
  {
    "path": "src/ANN/License.txt",
    "chars": 24564,
    "preview": "----------------------------------------------------------------------\nThe ANN Library (all versions) is provided under "
  },
  {
    "path": "src/ANN/ReadMe.txt",
    "chars": 2354,
    "preview": "ANN: Approximate Nearest Neighbors\nVersion: 1.1\nRelease date: May 3, 2005\n----------------------------------------------"
  },
  {
    "path": "src/ANN/bd_fix_rad_search.cpp",
    "chars": 2705,
    "preview": "//----------------------------------------------------------------------\n// File:\t\t\tbd_fix_rad_search.cpp\n// Programmer:"
  },
  {
    "path": "src/ANN/bd_pr_search.cpp",
    "chars": 2731,
    "preview": "//----------------------------------------------------------------------\n// File:\t\t\tbd_pr_search.cpp\n// Programmer:\t\tDav"
  },
  {
    "path": "src/ANN/bd_search.cpp",
    "chars": 2668,
    "preview": "//----------------------------------------------------------------------\n// File:\t\t\tbd_search.cpp\n// Programmer:\t\tDavid "
  },
  {
    "path": "src/ANN/bd_tree.cpp",
    "chars": 16213,
    "preview": "//----------------------------------------------------------------------\n// File:\t\t\tbd_tree.cpp\n// Programmer:\t\tDavid Mo"
  },
  {
    "path": "src/ANN/bd_tree.h",
    "chars": 3957,
    "preview": "//----------------------------------------------------------------------\n// File:\t\t\tbd_tree.h\n// Programmer:\t\tDavid Moun"
  },
  {
    "path": "src/ANN/brute.cpp",
    "chars": 4869,
    "preview": "//----------------------------------------------------------------------\n// File:\t\t\tbrute.cpp\n// Programmer:\t\tSunil Arya"
  },
  {
    "path": "src/ANN/kd_dump.cpp",
    "chars": 16918,
    "preview": "//----------------------------------------------------------------------\n// File:\t\t\tkd_dump.cc\n// Programmer:\t\tDavid Mou"
  },
  {
    "path": "src/ANN/kd_fix_rad_search.cpp",
    "chars": 8590,
    "preview": "//----------------------------------------------------------------------\n// File:\t\t\tkd_fix_rad_search.cpp\n// Programmer:"
  },
  {
    "path": "src/ANN/kd_fix_rad_search.h",
    "chars": 1792,
    "preview": "//----------------------------------------------------------------------\n// File:\t\t\tkd_fix_rad_search.h\n// Programmer:\t\t"
  },
  {
    "path": "src/ANN/kd_pr_search.cpp",
    "chars": 8855,
    "preview": "//----------------------------------------------------------------------\n// File:\t\t\tkd_pr_search.cpp\n// Programmer:\t\tSun"
  },
  {
    "path": "src/ANN/kd_pr_search.h",
    "chars": 2021,
    "preview": "//----------------------------------------------------------------------\n// File:\t\t\tkd_pr_search.h\n// Programmer:\t\tSunil"
  },
  {
    "path": "src/ANN/kd_search.cpp",
    "chars": 8608,
    "preview": "//----------------------------------------------------------------------\n// File:\t\t\tkd_search.cpp\n// Programmer:\t\tSunil "
  },
  {
    "path": "src/ANN/kd_search.h",
    "chars": 2043,
    "preview": "//----------------------------------------------------------------------\n// File:\t\t\tkd_search.h\n// Programmer:\t\tSunil Ar"
  },
  {
    "path": "src/ANN/kd_split.cpp",
    "chars": 16876,
    "preview": "//----------------------------------------------------------------------\n// File:\t\t\tkd_split.cpp\n// Programmer:\t\tSunil A"
  },
  {
    "path": "src/ANN/kd_split.h",
    "chars": 3713,
    "preview": "//----------------------------------------------------------------------\n// File:\t\t\tkd_split.h\n// Programmer:\t\tSunil Ary"
  },
  {
    "path": "src/ANN/kd_tree.cpp",
    "chars": 15275,
    "preview": "//----------------------------------------------------------------------\n// File:\t\t\tkd_tree.cpp\n// Programmer:\t\tSunil Ar"
  },
  {
    "path": "src/ANN/kd_tree.h",
    "chars": 7980,
    "preview": "//----------------------------------------------------------------------\n// File:\t\t\tkd_tree.h\n// Programmer:\t\tSunil Arya"
  },
  {
    "path": "src/ANN/kd_util.cpp",
    "chars": 15022,
    "preview": "//----------------------------------------------------------------------\n// File:\t\t\tkd_util.cpp\n// Programmer:\t\tSunil Ar"
  },
  {
    "path": "src/ANN/kd_util.h",
    "chars": 4751,
    "preview": "//----------------------------------------------------------------------\n// File:\t\t\tkd_util.h\n// Programmer:\t\tSunil Arya"
  },
  {
    "path": "src/ANN/perf.cpp",
    "chars": 5476,
    "preview": "//----------------------------------------------------------------------\n// File:\t\t\tperf.cpp\n// Programmer:\t\tSunil Arya "
  },
  {
    "path": "src/ANN/pr_queue.h",
    "chars": 4440,
    "preview": "//----------------------------------------------------------------------\n// File:\t\t\tpr_queue.h\n// Programmer:\t\tSunil Ary"
  },
  {
    "path": "src/ANN/pr_queue_k.h",
    "chars": 4367,
    "preview": "//----------------------------------------------------------------------\n// File:\t\t\tpr_queue_k.h\n// Programmer:\t\tSunil A"
  },
  {
    "path": "src/JP.cpp",
    "chars": 3715,
    "preview": "//----------------------------------------------------------------------\n//                  Jarvis-Patrick Clustering\n/"
  },
  {
    "path": "src/Makevars",
    "chars": 569,
    "preview": "# CXX_STD = CXX11\n\nSOURCES = \\\n\tANN/perf.cpp ANN/bd_fix_rad_search.cpp ANN/bd_search.cpp \\\n\tANN/kd_split.cpp ANN/kd_pr_s"
  },
  {
    "path": "src/RcppExports.cpp",
    "chars": 25331,
    "preview": "// Generated by using Rcpp::compileAttributes() -> do not edit by hand\n// Generator token: 10BE3573-1514-4C36-9D1C-5A225"
  },
  {
    "path": "src/UnionFind.cpp",
    "chars": 1389,
    "preview": "//----------------------------------------------------------------------\n//                        Disjoint-set data str"
  },
  {
    "path": "src/UnionFind.h",
    "chars": 923,
    "preview": "//----------------------------------------------------------------------\n//                        Disjoint-set data str"
  },
  {
    "path": "src/cleanup.cpp",
    "chars": 547,
    "preview": "//----------------------------------------------------------------------\n//              R interface to dbscan using the"
  },
  {
    "path": "src/connectedComps.cpp",
    "chars": 3555,
    "preview": "//----------------------------------------------------------------------\n//              R interface to dbscan using the"
  },
  {
    "path": "src/dbcv.cpp",
    "chars": 12077,
    "preview": "//----------------------------------------------------------------------\n//                                DBSCAN\n// Fil"
  },
  {
    "path": "src/dbscan.cpp",
    "chars": 4609,
    "preview": "//----------------------------------------------------------------------\n//                                DBSCAN\n// Fil"
  },
  {
    "path": "src/dendrogram.cpp",
    "chars": 7032,
    "preview": "//----------------------------------------------------------------------\n//              R interface to dbscan using the"
  },
  {
    "path": "src/density.cpp",
    "chars": 1874,
    "preview": "//----------------------------------------------------------------------\n//                                DBSCAN densit"
  },
  {
    "path": "src/frNN.cpp",
    "chars": 4994,
    "preview": "//----------------------------------------------------------------------\n//                   Fixed Radius Nearest Neigh"
  },
  {
    "path": "src/hdbscan.cpp",
    "chars": 35975,
    "preview": "//----------------------------------------------------------------------\n//              R interface to dbscan using the"
  },
  {
    "path": "src/kNN.cpp",
    "chars": 4119,
    "preview": "//----------------------------------------------------------------------\n//                  Find the k Nearest Neighbor"
  },
  {
    "path": "src/kNN.h",
    "chars": 252,
    "preview": "#ifndef KNN_H\n#define KNN_H\n\n#include <Rcpp.h>\n#include \"ANN/ANN.h\"\n\nusing namespace Rcpp;\n\n// returns knn + dist\n// [[R"
  },
  {
    "path": "src/lof.cpp",
    "chars": 3232,
    "preview": "//----------------------------------------------------------------------\n//                  Find the Neighbourhood for "
  },
  {
    "path": "src/lt.h",
    "chars": 877,
    "preview": "#ifndef LT\n#define LT\n\n/* LT_POS to access a lower triangle matrix by C. Buchta\n * modified by M. Hahsler\n * n ... numbe"
  },
  {
    "path": "src/mrd.cpp",
    "chars": 1625,
    "preview": "//----------------------------------------------------------------------\n//              R interface to dbscan using the"
  },
  {
    "path": "src/mst.cpp",
    "chars": 4806,
    "preview": "//----------------------------------------------------------------------\n//              R interface to dbscan using the"
  },
  {
    "path": "src/mst.h",
    "chars": 298,
    "preview": "#ifndef MST_H\n#define MST_H\n\n#include <Rcpp.h>\n#include \"lt.h\"\n\nusing namespace Rcpp;\n\n// Functions to compute MST and b"
  },
  {
    "path": "src/optics.cpp",
    "chars": 5891,
    "preview": "//----------------------------------------------------------------------\n//                                OPTICS\n// Fil"
  },
  {
    "path": "src/regionQuery.cpp",
    "chars": 2097,
    "preview": "//----------------------------------------------------------------------\n//                              Region Query\n//"
  },
  {
    "path": "src/regionQuery.h",
    "chars": 1314,
    "preview": "//----------------------------------------------------------------------\n//                              Region Query\n//"
  },
  {
    "path": "src/utilities.cpp",
    "chars": 2004,
    "preview": "//----------------------------------------------------------------------\n//              R interface to dbscan using the"
  },
  {
    "path": "src/utilities.h",
    "chars": 1225,
    "preview": "//----------------------------------------------------------------------\n//              R interface to dbscan using the"
  },
  {
    "path": "tests/testthat/test-dbcv.R",
    "chars": 1417,
    "preview": "test_that(\"dbcv\", {\n  # From: https://github.com/FelSiq/DBCV\n  #\n  # Dataset\t      MATLAB\n  # dataset_1.txt\t0.8576\n  # d"
  },
  {
    "path": "tests/testthat/test-dbscan.R",
    "chars": 2355,
    "preview": "test_that(\"dbscan works\", {\n  data(\"iris\")\n  ## Species is a factor\n  expect_error(dbscan(iris))\n\n  iris <- as.matrix(ir"
  },
  {
    "path": "tests/testthat/test-fosc.R",
    "chars": 3362,
    "preview": "test_that(\"FOSC\", {\n  data(\"iris\")\n\n  ## FOSC expects an hclust object\n  expect_error(extractFOSC(iris))\n\n  x <- iris[, "
  },
  {
    "path": "tests/testthat/test-frNN.R",
    "chars": 2758,
    "preview": "test_that(\"frNN\", {\n  set.seed(665544)\n  n <- 1000\n  x <- cbind(\n    x = runif(10, 0, 10) + rnorm(n, sd = 0.2),\n    y = "
  },
  {
    "path": "tests/testthat/test-hdbscan.R",
    "chars": 3996,
    "preview": "test_that(\"HDBSCAN\", {\n  data(\"iris\")\n\n  ## minPts not given\n  expect_error(hdbscan(iris))\n\n  ## Expects numerical data;"
  },
  {
    "path": "tests/testthat/test-kNN.R",
    "chars": 3780,
    "preview": "test_that(\"kNN\", {\n  set.seed(665544)\n  n <- 1000\n  x <- cbind(\n    x = runif(10, 0, 10) + rnorm(n, sd = 0.2),\n    y = r"
  },
  {
    "path": "tests/testthat/test-kNNdist.R",
    "chars": 379,
    "preview": "test_that(\"kNNdist\", {\n  set.seed(665544)\n  n <- 1000\n  x <- cbind(\n    x = runif(10, 0, 10) + rnorm(n, sd = 0.2),\n    y"
  },
  {
    "path": "tests/testthat/test-lof.R",
    "chars": 8343,
    "preview": "test_that(\"LOF\", {\n  set.seed(665544)\n  n <- 600\n  x <- cbind(\n    x=runif(10, 0, 5) + rnorm(n, sd=0.4),\n    y=runif(10,"
  },
  {
    "path": "tests/testthat/test-mst.R",
    "chars": 1704,
    "preview": "test_that(\"mst\", {\n  draw_mst <- function(x, m) {\n    plot(x)\n    text(x, labels = 1:nrow(x), pos = 1)\n    for (i in seq"
  },
  {
    "path": "tests/testthat/test-optics.R",
    "chars": 3495,
    "preview": "test_that(\"OPTICS\", {\n  load(test_path(\"fixtures\", \"test_data.rda\"))\n  load(test_path(\"fixtures\", \"elki_optics.rda\"))\n\n "
  },
  {
    "path": "tests/testthat/test-opticsXi.R",
    "chars": 530,
    "preview": "test_that(\"OPTICS-XI\", {\n  load(test_path(\"fixtures\", \"test_data.rda\"))\n  load(test_path(\"fixtures\", \"elki_optics.rda\"))"
  },
  {
    "path": "tests/testthat/test-predict.R",
    "chars": 1666,
    "preview": "test_that(\"predict\", {\n  set.seed(3)\n  n <- 100\n  x_data <- cbind(\n    x = runif(5, 0, 10) + rnorm(n, sd = 0.2),\n    y ="
  },
  {
    "path": "tests/testthat/test-sNN.R",
    "chars": 2152,
    "preview": "test_that(\"sNN\", {\n  set.seed(665544)\n  n <- 1000\n  x <- cbind(\n    x = runif(10, 0, 10) + rnorm(n, sd = 0.2),\n    y = r"
  },
  {
    "path": "tests/testthat.R",
    "chars": 56,
    "preview": "library(testthat)\nlibrary(dbscan)\n\ntest_check(\"dbscan\")\n"
  },
  {
    "path": "vignettes/dbscan.Rnw",
    "chars": 67796,
    "preview": "% !Rnw weave = Sweave\n\\documentclass[nojss]{jss}\n\n% Package includes\n\\usepackage[utf8]{inputenc}\n\\usepackage[english]{ba"
  },
  {
    "path": "vignettes/dbscan.bib",
    "chars": 32853,
    "preview": "@Article{hahsler2019dbscan,\n    title = {{dbscan}: Fast Density-Based Clustering with {R}},\n    author = {Michael Hahsle"
  },
  {
    "path": "vignettes/hdbscan.Rmd",
    "chars": 10155,
    "preview": "---\ntitle: \"HDBSCAN with the dbscan package\"\nauthor: \"Matt Piekenbrock, Michael Hahsler\"\nvignette: >\n  %\\VignetteIndexEn"
  }
]

// ... and 9 more files (download for full content)

About this extraction

This page contains the full source code of the mhahsler/dbscan GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 154 files (962.1 KB), approximately 309.7k tokens, and a symbol index with 185 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo