Full Code of ropensci/textreuse for AI

master 6f8cbe380295 cached
102 files
3.0 MB
801.6k tokens
9 symbols
1 requests
Download .txt
Showing preview only (3,212K chars total). Download the full file or copy to clipboard to get everything.
Repository: ropensci/textreuse
Branch: master
Commit: 6f8cbe380295
Files: 102
Total size: 3.0 MB

Directory structure:
gitextract_vbmxaw27/

├── .Rbuildignore
├── .gitignore
├── .travis.yml
├── CONDUCT.md
├── DESCRIPTION
├── LICENSE
├── Makefile
├── NAMESPACE
├── NEWS.md
├── R/
│   ├── RcppExports.R
│   ├── TextReuseCorpus.R
│   ├── TextReuseTextDocument.R
│   ├── align_local.R
│   ├── conversion-functions.R
│   ├── filenames.R
│   ├── lsh.R
│   ├── lsh_candidates.R
│   ├── lsh_compare.R
│   ├── lsh_probability.R
│   ├── lsh_query.R
│   ├── lsh_subset.R
│   ├── minhash.R
│   ├── pairwise_candidates.R
│   ├── pairwise_compare.R
│   ├── parallel.R
│   ├── rehash.R
│   ├── similarity.R
│   ├── textreuse-package.r
│   ├── token_index.R
│   ├── tokenize.R
│   ├── tokenizers.R
│   ├── utils.R
│   └── wordcount.R
├── README.Rmd
├── README.md
├── _pkgdown.yml
├── appveyor.yml
├── cran-comments.md
├── inst/
│   └── extdata/
│       ├── ats/
│       │   ├── calltounconv00baxt.txt
│       │   ├── gospeltruth00whit.txt
│       │   ├── lifeofrevrichard00baxt.txt
│       │   ├── memoirjamesbrai00ricegoog.txt
│       │   ├── practicalthought00nev.txt
│       │   ├── remember00palm.txt
│       │   ├── remembermeorholy00palm.txt
│       │   └── thoughtsonpopery00nevi.txt
│       └── legal/
│           ├── ca1851-match.txt
│           ├── ca1851-nomatch.txt
│           └── ny1850-match.txt
├── man/
│   ├── TextReuseCorpus.Rd
│   ├── TextReuseTextDocument-accessors.Rd
│   ├── TextReuseTextDocument.Rd
│   ├── align_local.Rd
│   ├── as.matrix.textreuse_candidates.Rd
│   ├── filenames.Rd
│   ├── hash_string.Rd
│   ├── lsh.Rd
│   ├── lsh_add.Rd
│   ├── lsh_candidates.Rd
│   ├── lsh_compare.Rd
│   ├── lsh_probability.Rd
│   ├── lsh_query.Rd
│   ├── lsh_subset.Rd
│   ├── minhash_generator.Rd
│   ├── pairwise_candidates.Rd
│   ├── pairwise_compare.Rd
│   ├── reexports.Rd
│   ├── rehash.Rd
│   ├── similarity-functions.Rd
│   ├── textreuse-package.Rd
│   ├── token_index.Rd
│   ├── token_index_candidates.Rd
│   ├── tokenize.Rd
│   ├── tokenizers.Rd
│   └── wordcount.Rd
├── pkgdown/
│   └── extra.css
├── src/
│   ├── RcppExports.cpp
│   ├── hash_string.cpp
│   ├── shingle_ngrams.cpp
│   ├── skip_ngrams.cpp
│   └── sw_matrix.cpp
├── tests/
│   ├── testthat/
│   │   ├── newman.txt
│   │   ├── test-TextReuseCorpus.R
│   │   ├── test-TextReuseTextDocument.R
│   │   ├── test-alignment.R
│   │   ├── test-candidate_pairs.R
│   │   ├── test-filenames.R
│   │   ├── test-hashing.R
│   │   ├── test-jaccard.R
│   │   ├── test-lsh.R
│   │   ├── test-minhash.R
│   │   ├── test-pairwise_cf.R
│   │   ├── test-ratio_of_matches.R
│   │   ├── test-token_index.R
│   │   ├── test-tokenizers.R
│   │   ├── test-utils.R
│   │   └── test-wordcount.R
│   └── testthat.R
└── vignettes/
    ├── textreuse-alignment.Rmd
    ├── textreuse-introduction.Rmd
    ├── textreuse-minhash.Rmd
    └── textreuse-pairwise.Rmd

================================================
FILE CONTENTS
================================================

================================================
FILE: .Rbuildignore
================================================
^.*\.Rproj$
^\.Rproj\.user$
^\.git$
^\.r-lib$
^README\.Rmd$
^README-*\.png$
^data-raw$
^\.travis\.yml$
wordnet
^appveyor\.yml$
^CONDUCT\.md$
^cran-comments\.md$
^Makefile$
^_pkgdown\.yml$
^pkgdown$
docs/


================================================
FILE: .gitignore
================================================
.Rproj
*.Rproj
.Rproj.user
.Rhistory
.RData
.Ruserdata
src/*.o
src/*.so
src/*.dll


================================================
FILE: .travis.yml
================================================
language: r
r:
  - oldrel
  - release
  - devel
sudo: false
cache: packages

after_success:
  - Rscript -e 'covr::codecov()'

notifications:
  email:
    on_success: change
    on_failure: change
  slack:
    secure: gxP5b9VO52sKP72YB1iFwt5U73s6O1nq9o1vH6ddrvEIRgpzSQO7lIH8/KYfjj+eFRXCIWtFnrkar2kw2sfGJVERnJ9R13XtVDc23tApkZjacTxHUov39WbS4zI03Tb9pX86ywUNcs0rhVKok3CD9V80fybd3nFy8Vy/ugSBp7s=



================================================
FILE: CONDUCT.md
================================================
# Contributor Code of Conduct

As contributors and maintainers of this project, we pledge to respect all people who 
contribute through reporting issues, posting feature requests, updating documentation,
submitting pull requests or patches, and other activities.

We are committed to making participation in this project a harassment-free experience for
everyone, regardless of level of experience, gender, gender identity and expression,
sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion.

Examples of unacceptable behavior by participants include the use of sexual language or
imagery, derogatory comments or personal attacks, trolling, public or private harassment,
insults, or other unprofessional conduct.

Project maintainers have the right and responsibility to remove, edit, or reject comments,
commits, code, wiki edits, issues, and other contributions that are not aligned to this 
Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed 
from the project team.

Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by 
opening an issue or contacting one or more of the project maintainers.

This Code of Conduct is adapted from the Contributor Covenant 
(http:contributor-covenant.org), version 1.0.0, available at 
http://contributor-covenant.org/version/1/0/0/


================================================
FILE: DESCRIPTION
================================================
Package: textreuse
Type: Package
Title: Detect Text Reuse and Document Similarity
Version: 1.0.1
Date: 2026-05-06
Authors@R: c(
    person("Lincoln", "Mullen", role = "aut",
        comment = c(ORCID = "0000-0001-5103-6917")
    ),
    person("Yaoxiang", "Li", role = c("aut", "cre"),
        email = "liyaoxiang@outlook.com",
        comment = c(ORCID = "0000-0001-9200-1016")))
Description: Tools for measuring similarity among documents and detecting
    passages which have been reused. Implements shingled n-gram, skip n-gram,
    and other tokenizers; similarity/dissimilarity functions; pairwise
    comparisons; minhash and locality sensitive hashing algorithms; and a
    version of the Smith-Waterman local alignment algorithm suitable for
    natural language.
License: MIT + file LICENSE
LazyData: TRUE
URL: https://docs.ropensci.org/textreuse/,
    https://github.com/ropensci/textreuse
BugReports: https://github.com/ropensci/textreuse/issues
VignetteBuilder: knitr
Depends:
    R (>= 3.1.1)
Imports:
    assertthat (>= 0.1),
    digest (>= 0.6.8),
    dplyr (>= 0.8.0),
    NLP (>= 0.1.8),
    Matrix,
    Rcpp (>= 0.12.0),
    RcppProgress  (>= 0.1),
    stringr (>= 1.0.0),
    tibble (>= 3.0.1),
    tidyr (>= 1.0.0)
Suggests:
    testthat (>= 0.11.0),
    knitr (>= 1.11),
    rmarkdown (>= 0.8),
    covr
LinkingTo: BH, Rcpp, RcppProgress
RoxygenNote: 7.3.2
Encoding: UTF-8


================================================
FILE: LICENSE
================================================
YEAR: 2026
COPYRIGHT HOLDER: Yaoxiang Li and Lincoln Mullen


================================================
FILE: Makefile
================================================
.PHONY : docs deploy-docs

docs :
	Rscript -e "pkgdown::clean_site(); pkgdown::build_site(run_dont_run = TRUE)"

deploy-docs :
	@echo "Documentation is published by rOpenSci at https://docs.ropensci.org/textreuse/"



================================================
FILE: NAMESPACE
================================================
# Generated by roxygen2: do not edit by hand

S3method("[",TextReuseCorpus)
S3method("[[",TextReuseCorpus)
S3method("content<-",TextReuseTextDocument)
S3method("hashes<-",TextReuseTextDocument)
S3method("meta<-",TextReuseCorpus)
S3method("meta<-",TextReuseTextDocument)
S3method("minhashes<-",TextReuseTextDocument)
S3method("names<-",TextReuseCorpus)
S3method("tokens<-",TextReuseTextDocument)
S3method(align_local,TextReuseTextDocument)
S3method(align_local,default)
S3method(as.character,TextReuseTextDocument)
S3method(as.matrix,textreuse_candidates)
S3method(content,TextReuseTextDocument)
S3method(count_matches,TextReuseTextDocument)
S3method(count_matches,default)
S3method(hashes,TextReuseCorpus)
S3method(hashes,TextReuseTextDocument)
S3method(jaccard_bag_similarity,TextReuseTextDocument)
S3method(jaccard_bag_similarity,default)
S3method(jaccard_dissimilarity,default)
S3method(jaccard_similarity,TextReuseTextDocument)
S3method(jaccard_similarity,default)
S3method(length,TextReuseCorpus)
S3method(lsh,TextReuseCorpus)
S3method(lsh,TextReuseTextDocument)
S3method(matching_tokens,TextReuseTextDocument)
S3method(matching_tokens,default)
S3method(meta,TextReuseCorpus)
S3method(meta,TextReuseTextDocument)
S3method(minhashes,TextReuseCorpus)
S3method(minhashes,TextReuseTextDocument)
S3method(names,TextReuseCorpus)
S3method(print,TextReuseCorpus)
S3method(print,TextReuseTextDocument)
S3method(print,textreuse_alignment)
S3method(ratio_of_matches,TextReuseTextDocument)
S3method(ratio_of_matches,default)
S3method(rehash,TextReuseCorpus)
S3method(rehash,TextReuseTextDocument)
S3method(tokenize,TextReuseCorpus)
S3method(tokenize,TextReuseTextDocument)
S3method(tokens,TextReuseCorpus)
S3method(tokens,TextReuseTextDocument)
S3method(wordcount,TextDocument)
S3method(wordcount,TextReuseCorpus)
S3method(wordcount,default)
export("content<-")
export("hashes<-")
export("meta<-")
export("minhashes<-")
export("tokens<-")
export(TextReuseCorpus)
export(TextReuseTextDocument)
export(align_local)
export(as_sparse_matrix)
export(content)
export(count_matches)
export(filenames)
export(has_content)
export(has_hashes)
export(has_minhashes)
export(has_tokens)
export(hash_string)
export(hashes)
export(is.TextReuseCorpus)
export(is.TextReuseTextDocument)
export(jaccard_bag_similarity)
export(jaccard_dissimilarity)
export(jaccard_similarity)
export(lsh)
export(lsh_add)
export(lsh_candidates)
export(lsh_compare)
export(lsh_probability)
export(lsh_query)
export(lsh_subset)
export(lsh_threshold)
export(matching_tokens)
export(meta)
export(minhash_generator)
export(minhashes)
export(pairwise_candidates)
export(pairwise_compare)
export(ratio_of_matches)
export(rehash)
export(skipped)
export(token_index)
export(token_index_candidates)
export(tokenize)
export(tokenize_ngrams)
export(tokenize_sentences)
export(tokenize_skip_ngrams)
export(tokenize_words)
export(tokens)
export(wordcount)
import(RcppProgress)
import(assertthat)
import(stringr)
importFrom(NLP,"content<-")
importFrom(NLP,"meta<-")
importFrom(NLP,content)
importFrom(NLP,meta)
importFrom(Rcpp,sourceCpp)
importFrom(utils,getTxtProgressBar)
importFrom(utils,setTxtProgressBar)
importFrom(utils,txtProgressBar)
useDynLib(textreuse, .registration = TRUE)


================================================
FILE: NEWS.md
================================================
# textreuse 1.0.1

This release brings together several years of maintenance and feature work to
make textreuse easier to use on current R installations and more practical for
larger document collections.

This is a CRAN resubmission that fixes a moved README URL reported by CRAN
incoming checks.

## Text input and corpus construction

- `TextReuseTextDocument()` and `TextReuseCorpus()` now accept an `encoding`
  argument, making it easier to read source files whose text encoding is known
  or differs from the platform default.
- `TextReuseCorpus()` now keeps skipped-document bookkeeping deterministic.
  Skipped documents are reported consistently, and skip metadata is available
  even when `skip_short = FALSE`.
- Very short documents are handled more predictably when skip n-grams are used,
  avoiding assertion failures and making corpus construction easier to diagnose.

## Alignment and match inspection

- `align_local()` now returns an empty local alignment instead of throwing an
  error when two texts have no matching words. This makes batch alignment
  workflows easier to run because no-match pairs can be represented directly.
- `align_local()` gains `preserve_punctuation`, allowing displayed alignments to
  keep punctuation from the original texts when that context is useful.
- New `count_matches()` and `matching_tokens()` helpers expose absolute match
  counts and the matched tokens themselves, so users can inspect what drove a
  similarity score rather than relying only on a ratio.

## Candidate generation and comparison

- New token-index helpers find candidate document pairs from shared n-grams,
  giving users another way to identify likely reuse pairs before running more
  expensive comparisons.
- `pairwise_candidates()` and matrix conversion now preserve all document IDs,
  including documents without returned candidate pairs.
- `as_sparse_matrix()` provides a sparse matrix representation of candidate
  results, which is more convenient for downstream modeling, graph analysis, and
  workflows with many documents.

## Locality-sensitive hashing

- `lsh_add()` can add new documents to an existing LSH bucket cache, so users can
  extend an index without rebuilding it from scratch.
- `lsh_compare()` can run comparisons in parallel on non-Windows platforms when
  `options(mc.cores)` is set.
- Long-running C++ hashing and n-gram loops now check for user interrupts, so
  expensive jobs can be stopped more cleanly from R.

## Compatibility and documentation

- Compatibility with current dplyr and tidyr releases has been refreshed.
- README, vignette, reference, and pkgdown examples were regenerated against
  current package output.
- Stale external links and documentation badges were updated so package checks
  and the public documentation site are cleaner.

# textreuse 0.1.4

- Preventative maintenance release to avoid failing tests when new version of
  BH is released.

# textreuse 0.1.3

- Preventative maintenance release to avoid failing tests when new versions of 
  the dplyr and testthat packages are released.

# textreuse 0.1.2

- Fix memory error in `shingle_ngrams()`
- Fix tests for retokenizing on Windows
- More informative error message if using `lsh()` on corpora without minhashes

# textreuse 0.1.1

- Fix progress bars in vignettes

# textreuse 0.1.0

- Initial release


================================================
FILE: R/RcppExports.R
================================================
# Generated by using Rcpp::compileAttributes() -> do not edit by hand
# Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393

#' Hash a string to an integer
#' @param x A character vector to be hashed.
#' @return A vector of integer hashes.
#' @examples
#' s <- c("How", "many", "roads", "must", "a", "man", "walk", "down")
#' hash_string(s)
#' @export
hash_string <- function(x) {
    .Call(`_textreuse_hash_string`, x)
}

shingle_ngrams <- function(words, n) {
    .Call(`_textreuse_shingle_ngrams`, words, n)
}

skip_ngrams <- function(words, n, k) {
    .Call(`_textreuse_skip_ngrams`, words, n, k)
}

sw_matrix <- function(m, a, b, match, mismatch, gap, progress) {
    .Call(`_textreuse_sw_matrix`, m, a, b, match, mismatch, gap, progress)
}



================================================
FILE: R/TextReuseCorpus.R
================================================
#' TextReuseCorpus
#'
#' This is the constructor function for a \code{TextReuseCorpus}, modeled on the
#' virtual S3 class \code{Corpus} from the \code{tm} package. The
#' object is a \code{TextReuseCorpus}, which is basically a list containing
#' objects of class \code{\link{TextReuseTextDocument}}. Arguments are passed
#' along to that constructor function. To create the corpus, you can pass either
#' a character vector of paths to text files using the \code{paths =} parameter,
#' a directory containing text files (with any extension) using the \code{dir =}
#' parameter, or a character vector of documents using the \code{text = }
#' parameter, where each element in the characer vector is a document. If the
#' character vector passed to \code{text = } has names, then those names will be
#' used as the document IDs. Otherwise, IDs will be assigned to the documents.
#' Only one of the \code{paths}, \code{dir}, or \code{text} parameters should be
#' specified.
#'
#' @details If \code{skip_short = TRUE}, this function will skip very short or
#'   empty documents. A very short document is one where there are too few words
#'   to create at least two n-grams. For example, if five-grams are desired,
#'   then a document must be at least six words long. If no value of \code{n} is
#'   provided, then the function assumes a value of \code{n = 3}. A warning will
#'   be printed with the document ID of each skipped document. Use
#'   \code{skipped()} to get the IDs of skipped documents.
#'
#'   This function will use multiple cores on non-Windows machines if the
#'   \code{"mc.cores"} option is set. For example, to use four cores:
#'   \code{options("mc.cores" = 4L)}.
#'
#' @param paths A character vector of paths to files to be opened.
#' @param dir The path to a directory of text files.
#' @param text A character vector (possibly named) of documents.
#' @param meta A list with named elements for the metadata associated with this
#'   corpus.
#' @param progress Display a progress bar while loading files.
#' @param tokenizer A function to split the text into tokens. See
#'   \code{\link{tokenizers}}. If value is \code{NULL}, then tokenizing and
#'   hashing will be skipped.
#' @param ... Arguments passed on to the \code{tokenizer}.
#' @param hash_func A function to hash the tokens. See
#'   \code{\link{hash_string}}.
#' @param minhash_func A function to create minhash signatures of the document.
#'   See \code{\link{minhash_generator}}.
#' @param keep_tokens Should the tokens be saved in the documents that are
#'   returned or discarded?
#' @param keep_text Should the text be saved in the documents that are returned
#'   or discarded?
#' @param skip_short Should short documents be skipped? (See details.)
#' @param encoding Encoding to be used when reading files.
#'
#' @seealso \link[=TextReuseTextDocument-accessors]{Accessors for TextReuse
#'   objects}.
#'
#' @examples
#' dir <- system.file("extdata/legal", package = "textreuse")
#' corpus <- TextReuseCorpus(dir = dir, meta = list("description" = "Field Codes"))
#' # Subset by position or file name
#' corpus[[1]]
#' names(corpus)
#' corpus[["ca1851-match"]]
#'
#' @export
TextReuseCorpus <- function(paths, dir = NULL, text = NULL, meta = list(),
                            progress = interactive(),
                            tokenizer = tokenize_ngrams, ...,
                            hash_func = hash_string,
                            minhash_func = NULL,
                            keep_tokens = FALSE,
                            keep_text = TRUE,
                            skip_short = TRUE,
                            encoding = "unknown") {

  if (!is.null(tokenizer)) {
    assert_that(is.function(tokenizer),
                is.function(hash_func))
    tokenizer_name <- as.character(substitute(tokenizer))
    hash_func_name <- as.character(substitute(hash_func))
    if (!is.null(minhash_func)) {
      minhash_func_name <- as.character(substitute(minhash_func))
    } else {
      minhash_func_name <- NULL
    }
    loading_msg <- "Loading, tokenizing, and hashing "
  } else {
    tokenizer_name <- NULL
    hash_func_name <- NULL
    minhash_func_name <- NULL
    loading_msg <- "Loading "
  }

  apply_func <- get_apply_function()

  # If we get a character vector of documents, use that; otherwise load
  # the files from disk.
  if (!missing(text)) {

    assert_that(missing(paths),
                is.null(dir),
                is.character(text))

    if (progress) {
      len <- length(text)
      message(loading_msg, prettyNum(len, big.mark = ","), " documents.")
      if (using_parallel())
        progress <- FALSE
      else
        pb <- txtProgressBar(min = 0, max = len, style = 3)
    }

    if (is.null(names(text)))
      names(text) <- str_c("doc-", 1:length(text))

    docs <- apply_func(seq_along(text), function(i) {
      d <- TextReuseTextDocument(text = text[i],
                                 tokenizer = tokenizer, ...,
                                 hash_func = hash_func,
                                 minhash_func = minhash_func,
                                 keep_tokens = keep_tokens,
                                 keep_text = keep_text,
                                 skip_short = skip_short,
                                 encoding = encoding,
                                 meta = list(id = names(text)[i],
                                             tokenizer = tokenizer_name,
                                             hash_func = hash_func_name,
                                             minhash_func = minhash_func_name))
      if (progress) setTxtProgressBar(pb, i)
      d
    })

    if (progress) close(pb)

    names(docs) <- names(text)

  } else {

    if (missing(paths) & !is.null(dir)) {
      assert_that(is.dir(dir))
      paths <- Sys.glob(str_c(dir, "/*"))
    }

    vapply(paths, is.readable, logical(1), USE.NAMES = FALSE)

    if (progress) {
      len <- length(paths)
      message(loading_msg, prettyNum(len, big.mark = ","), " documents.")
      if (using_parallel())
        progress <- FALSE
      else
        pb <- txtProgressBar(min = 0, max = len, style = 3)
    }
    docs <- apply_func(seq_along(paths), function(i) {
      d <- TextReuseTextDocument(file = paths[i], tokenizer = tokenizer, ...,
                                 hash_func = hash_func,
                                 minhash_func = minhash_func,
                                 keep_tokens = keep_tokens,
                                 keep_text = keep_text,
                                 skip_short = skip_short,
                                 encoding = encoding,
                                 meta = list(tokenizer = tokenizer_name,
                                             hash_func = hash_func_name,
                                             minhash_func = minhash_func_name))
      if (progress) setTxtProgressBar(pb, i)
      d
    })

    if (progress) close(pb)

    names(docs) <- filenames(paths)
  }

  skipped <- character()

  # Filter documents that were skipped because they were too short
  if (skip_short) {
    skipped_docs <- vapply(docs, is.null, logical(1))
    skipped <- names(docs)[skipped_docs]
    docs <- docs[!skipped_docs]
    if (length(skipped) > 0)
      warning("Skipped ", length(skipped), " documents that were too short. ",
              "Use `skipped()` to get their IDs.")
  }

  assert_that(is.list(meta))
  meta$tokenizer <- tokenizer_name
  meta$hash_func <- hash_func_name
  meta$minhash_func <- minhash_func_name

  if (!is.null(names(meta))) meta <- sort_meta(meta)

  corpus <- list(documents = docs, meta = meta)
  class(corpus) <- c("TextReuseCorpus", "Corpus")
  attr(corpus, "skipped") <- skipped

  corpus

}

#' @export
meta.TextReuseCorpus <- function(x, tag = NULL, ...) {
  if (is.null(tag))
    x$meta
  else
    x$meta[[tag]]
}

#' @export
`meta<-.TextReuseCorpus` <- function(x, tag = NULL, ..., value) {
  if (is.null(tag)) {
    assert_that(is.list(value))
    x$meta <- value
  } else {
    x$meta[[tag]] <- value
  }
  x
}

#' @export
print.TextReuseCorpus <- function(x, ...) {
  cat("TextReuseCorpus\n")
  cat("Number of documents:", length(x), "\n")
  pretty_print_metadata(x)
}

#' @export
length.TextReuseCorpus <- function(x) {
  length(x$documents)
}

#' @export
`[.TextReuseCorpus` <- function(x, i) {
  x$documents <- x$documents[i]
  x
}

#' @export
`[[.TextReuseCorpus` <- function(x, i) {
  x$documents[[i]]
}

#' @export
names.TextReuseCorpus <- function(x) {
  names(x$documents)
}

#' @export
`names<-.TextReuseCorpus` <- function(x, value) {
  names(x$documents) <- value
  x
}

#' @param x An R object to check.
#' @export
#' @rdname TextReuseCorpus
is.TextReuseCorpus <- function(x) {
  inherits(x, "TextReuseCorpus")
}

#' @export
#' @rdname TextReuseCorpus
skipped <- function(x) {
  assert_that(is.TextReuseCorpus(x))
  attr(x, "skipped", exact = TRUE)
}


================================================
FILE: R/TextReuseTextDocument.R
================================================
#' TextReuseTextDocument
#'
#' This is the constructor function for \code{TextReuseTextDocument} objects.
#' This class is used for comparing documents.
#'
#' @param text A character vector containing the text of the document. This
#'   argument can be skipped if supplying \code{file}.
#' @param file The path to a text file, if \code{text} is not provided.
#' @param meta A list with named elements for the metadata associated with this
#'   document. If a document is created using the \code{text} parameter, then
#'   you must provide an \code{id} field, e.g., \code{meta = list(id =
#'   "my_id")}. If the document is created using \code{file}, then the ID will
#'   be created from the file name.
#' @param tokenizer A function to split the text into tokens. See
#'   \code{\link{tokenizers}}. If value is \code{NULL}, then tokenizing and
#'   hashing will be skipped.
#' @param ... Arguments passed on to the \code{tokenizer}.
#' @param hash_func A function to hash the tokens. See
#'   \code{\link{hash_string}}.
#' @param minhash_func A function to create minhash signatures of the document.
#'   See \code{\link{minhash_generator}}.
#' @param keep_tokens Should the tokens be saved in the document that is
#'   returned or discarded?
#' @param keep_text Should the text be saved in the document that is returned or
#'   discarded?
#' @param skip_short Should short documents be skipped? (See details.)
#' @param encoding Encoding to be used when reading files.
#'
#' @details This constructor function follows a three-step process. It reads in
#'   the text, either from a file or from memory. It then tokenizes that text.
#'   Then it hashes the tokens. Most of the comparison functions in this package
#'   rely only on the hashes to make the comparison. By passing \code{FALSE} to
#'   \code{keep_tokens} and \code{keep_text}, you can avoid saving those
#'   objects, which can result in significant memory savings for large corpora.
#'
#'   If \code{skip_short = TRUE}, this function will return \code{NULL} for very
#'   short or empty documents. A very short document is one where there are too
#'   few words to create at least two n-grams. For example, if five-grams are
#'   desired, then a document must be at least six words long. If no value of
#'   \code{n} is provided, then the function assumes a value of \code{n = 3}. A
#'   warning will be printed with the document ID of a skipped document.
#'
#' @return An object of class \code{TextReuseTextDocument}. This object inherits
#'   from the virtual S3 class \code{\link[NLP]{TextDocument}} in the NLP
#'   package. It contains the following elements: \describe{ \item{content}{The
#'   text of the document.} \item{tokens}{The tokens created from the text.}
#'   \item{hashes}{Hashes created from the tokens.} \item{minhashes}{The minhash
#'   signature of the document.} \item{metadata}{The document metadata,
#'   including the filename (if any) in \code{file}.} }
#'
#' @seealso \link[=TextReuseTextDocument-accessors]{Accessors for TextReuse
#'   objects}.
#'
#' @examples
#' file <- system.file("extdata/legal/ny1850-match.txt", package = "textreuse")
#' doc  <- TextReuseTextDocument(file = file, meta = list(id = "ny1850"))
#' print(doc)
#' meta(doc)
#' head(tokens(doc))
#' head(hashes(doc))
#' \dontrun{
#' content(doc)
#' }
#' @export
TextReuseTextDocument <- function(text, file = NULL, meta = list(),
                                  tokenizer = tokenize_ngrams, ...,
                                  hash_func = hash_string,
                                  minhash_func = NULL,
                                  keep_tokens = FALSE,
                                  keep_text = TRUE,
                                  skip_short = TRUE,
                                  encoding = "unknown") {

  if (!missing(text)) assert_that(has_id(meta))

  if (!is.null(file)) {
    assert_that(missing(text),
                is.readable(file))
    text <- as_string(readLines(file, encoding = encoding))
  }

  assert_that(is.character(text))
  text <- as_string(text)

  # Define document ID early
  document_id <- ifelse(is.null(meta$id), filenames(file), meta$id)

  # Check length of document
  if (skip_short) {
    min_words <- short_document_word_minimum(tokenizer, list(...))
    if (wordcount(text) < min_words) {
      warning("Skipping document with ID '", document_id,
              "' because it has too few words ",
              "to create tokens with the requested tokenizer.",
              call. = FALSE, noBreaks. = TRUE)
      return(NULL)
    }
  }

  # Tokenize and hash
  if (!is.null(tokenizer)) {

    assert_that(is.function(tokenizer))
    tokenizer_name <- as.character(substitute(tokenizer))
    tokens <- tokenizer(text, ...)

    assert_that(is.function(hash_func))
    hash_func_name <- as.character(substitute(hash_func))
    hashes <- hash_func(tokens)

    # Also minhash if requested
    if (!is.null(minhash_func)) {
      assert_that(is.function(minhash_func))
      minhash_func_name <- as.character(substitute(minhash_func))
      minhashes <- minhash_func(tokens)
    } else {
      minhashes <- NULL
      minhash_func_name <- NULL
    }

  } else {
    tokens <- NULL
    hashes <- NULL
    minhashes <- NULL
    tokenizer_name <- NULL
    hash_func_name <- NULL
    minhash_func_name <- NULL

  }

  if (!keep_tokens) tokens <- NULL
  if (!keep_text) text <- NULL

  if (missing(meta)) {
    meta <- list(file = file,
                 id = document_id,
                 tokenizer = tokenizer_name,
                 hash_func = hash_func_name,
                 minhash_func = minhash_func_name)
  }
  assert_that(is.list(meta))
  if (!is.null(file)) {
    meta$file <- file
    meta$id <- document_id
  }
  # Don't overwrite these when called from TextReuseCorpus
  if (is.null(meta$tokenizer) & is.null(meta$hash_func) &
      is.null(meta$minhash_func)) {
    meta$tokenizer <- tokenizer_name
    meta$hash_func <- hash_func_name
    meta$minhash_func <- minhash_func_name
  }

  meta <- sort_meta(meta)

  doc <- list(
    content   = text,
    tokens    = tokens,
    hashes    = hashes,
    minhashes = minhashes,
    meta      = meta
    )

  class(doc) <- c("TextReuseTextDocument", "TextDocument")

  doc

}

short_document_word_minimum <- function(tokenizer, args) {
  n <- args$n
  if (is.null(n)) n <- 3

  if (!is.null(tokenizer) && identical(tokenizer, tokenize_skip_ngrams)) {
    k <- args$k
    if (is.null(k)) k <- 1
    return(n + n * k - k)
  }

  n + 1
}

#' @importFrom NLP meta
#' @export
NLP::meta

#' @importFrom NLP meta<-
#' @export
NLP::`meta<-`

#' @importFrom NLP content
#' @export
NLP::content

#' @importFrom NLP content<-
#' @export
NLP::`content<-`

#' @export
print.TextReuseTextDocument <- function(x, ...) {
  cat("TextReuseTextDocument\n")
  pretty_print_metadata(x)
  cat("content", ":", str_sub(x$content, end = 200))
  invisible(x)
}

#' @export
as.character.TextReuseTextDocument <- function(x, ...) {
  as.character(x$content)
}

#' @export
#' @method content TextReuseTextDocument
content.TextReuseTextDocument <- function(x) {
  x$content
}

#' @export
#' @method content<- TextReuseTextDocument
`content<-.TextReuseTextDocument` <- function(x, value) {
  assert_that(is.character(value))
  x$content <- value
  x
}

#' @export
#' @method meta TextReuseTextDocument
meta.TextReuseTextDocument <- function(x, tag = NULL, ...) {
  if (is.null(tag))
    x$meta
  else
    x$meta[[tag]]
}

#' @export
#' @method meta<- TextReuseTextDocument
`meta<-.TextReuseTextDocument` <- function(x, tag = NULL, ..., value) {
  if (is.null(tag)) {
    assert_that(is.list(value))
    x$meta <- value
  } else {
    x$meta[[tag]] <- value
  }
  x
}

#' Accessors for TextReuse objects
#'
#' Accessor functions to read and write components of
#' \code{\link{TextReuseTextDocument}} and \code{\link{TextReuseCorpus}}
#' objects.
#' @name TextReuseTextDocument-accessors
#' @param x The object to access.
#' @param value The value to assign.
#' @return Either a vector or a named list of vectors.
NULL

#' @export
#' @rdname TextReuseTextDocument-accessors
tokens <- function(x) UseMethod("tokens", x)

#' @export
tokens.TextReuseTextDocument <- function(x) x$tokens

#' @export
tokens.TextReuseCorpus <- function(x) {
  corpus_names <- names(x)
  l <- lapply(x$documents, function(i) tokens(i))
  names(l) <- corpus_names
  l
}

#' @export
#' @rdname TextReuseTextDocument-accessors
`tokens<-` <- function(x, value) UseMethod("tokens<-", x)

#' @export
`tokens<-.TextReuseTextDocument` <- function(x, value) {
  x$tokens <- value
  x
}

#' @export
#' @rdname TextReuseTextDocument-accessors
hashes <- function(x) UseMethod("hashes", x)

#' @export
hashes.TextReuseTextDocument <- function(x) x$hashes

#' @export
hashes.TextReuseCorpus <- function(x) {
  corpus_names <- names(x)
  l <- lapply(x$documents, function(i) hashes(i))
  names(l) <- corpus_names
  l
}

#' @export
#' @rdname TextReuseTextDocument-accessors
`hashes<-` <- function(x, value) UseMethod("hashes<-", x)

#' @export
`hashes<-.TextReuseTextDocument` <- function(x, value) {
  x$hashes <- value
  x
}

#' @export
#' @rdname TextReuseTextDocument-accessors
minhashes <- function(x) UseMethod("minhashes", x)

#' @export
minhashes.TextReuseTextDocument <- function(x) x$minhashes

#' @export
minhashes.TextReuseCorpus <- function(x) {
  corpus_names <- names(x)
  l <- lapply(x$documents, function(i) minhashes(i))
  names(l) <- corpus_names
  l
}

#' @export
#' @rdname TextReuseTextDocument-accessors
`minhashes<-` <- function(x, value) UseMethod("minhashes<-", x)

#' @export
`minhashes<-.TextReuseTextDocument` <- function(x, value) {
  x$minhashes <- value
  x
}

#' @param x An R object to check.
#' @export
#' @rdname TextReuseTextDocument
is.TextReuseTextDocument <- function(x) {
  inherits(x, "TextReuseTextDocument")
}

#' @export
#' @rdname TextReuseTextDocument
has_content <- function(x) {
  assert_that(is.TextReuseTextDocument(x))
  !is.null(x$content)
}

assertthat::on_failure(has_content) <- function(call, env) {
  paste0("Document does not have text in its content field.")
}

#' @export
#' @rdname TextReuseTextDocument
has_tokens <- function(x) {
  assert_that(is.TextReuseTextDocument(x))
  !is.null(x$tokens)
}

assertthat::on_failure(has_tokens) <- function(call, env) {
  "Document does not have tokens."
}

#' @export
#' @rdname TextReuseTextDocument
has_hashes <- function(x) {
  assert_that(is.TextReuseTextDocument(x))
  !is.null(x$hashes)
}

assertthat::on_failure(has_hashes) <- function(call, env) {
  "Document does not have hashes."
}

#' @export
#' @rdname TextReuseTextDocument
has_minhashes <- function(x) {
  assert_that(is.TextReuseTextDocument(x))
  !is.null(x$minhashes)
}

assertthat::on_failure(has_minhashes) <- function(call, env) {
  "Document does not have a minhash signature."
}

has_minhashes_corpus <- function(x) {
  assert_that(is.TextReuseCorpus(x))
  all(vapply(minhashes(x), Negate(is.null), logical(1)))
}

assertthat::on_failure(has_minhashes_corpus) <- function(call, env) {
  "Some documents in the corpus do not have a minhash signature."
}



================================================
FILE: R/align_local.R
================================================
#' Local alignment of natural language texts
#'
#' This function takes two texts, either as strings or as
#' \code{TextReuseTextDocument} objects, and finds the optimal local alignment
#' of those texts. A local alignment finds the best matching subset of the two
#' documents. This function adapts the
#' \href{https://en.wikipedia.org/wiki/Smith-Waterman_algorithm}{Smith-Waterman
#' algorithm}, used for genetic sequencing, for use with natural language. It
#' compare the texts word by word (the comparison is case-insensitive) and
#' scores them according to a set of parameters. These parameters define the
#' score for a \code{match}, and the penalties for a \code{mismatch} and for
#' opening a \code{gap} (i.e., the first mismatch in a potential sequence). The
#' function then reports the optimal local alignment. Only the subset of the
#' documents that is a match is included. Insertions or deletions in the text
#' are reported with the \code{edit_mark} character.
#'
#' @param a A character vector of length one, or a
#'   \code{\link{TextReuseTextDocument}}.
#' @param b A character vector of length one, or a
#'   \code{\link{TextReuseTextDocument}}.
#' @param match The score to assign a matching word. Should be a positive
#'   integer.
#' @param mismatch The score to assign a mismatching word. Should be a negative
#'   integer or zero.
#' @param gap The penalty for opening a gap in the sequence. Should be a
#'   negative integer or zero.
#' @param edit_mark A single character used for displaying for displaying
#'   insertions/deletions in the documents.
#' @param preserve_punctuation Preserve punctuation in the displayed alignment.
#'   The alignment still compares tokens after stripping punctuation.
#' @param progress Display a progress bar and messages while computing the
#'   alignment.
#'
#' @return A list with the class \code{textreuse_alignment}. This list contains
#'   several elements: \itemize{ \item \code{a_edit} and \code{b_edit}:
#'   Character vectors of the sequences with edits marked. \item \code{score}:
#'   The score of the optimal alignment. }
#'
#' @details
#'
#' The compute time of this function is proportional to the product of the
#' lengths of the two documents. Thus, longer documents will take considerably
#' more time to compute. This function has been tested with pairs of documents
#' containing about 25 thousand words each.
#'
#' If the function reports that there were multiple optimal alignments, then it
#' is likely that there is no strong match in the document.
#'
#' The score reported for the local alignment is dependent on both the size of
#' the documents and on the strength of the match, as well as on the parameters
#' for match, mismatch, and gap penalties, so the scores are not directly
#' comparable.
#'
#' @references For a useful description of the algorithm, see
#'   \href{http://etherealbits.com/2013/04/string-alignment-dynamic-programming-dna/}{this
#'   post}. For the application of the Smith-Waterman algorithm to natural
#'   language, see David A. Smith, Ryan Cordell, and Elizabeth Maddock Dillon,
#'   "Infectious Texts: Modeling Text Reuse in Nineteenth-Century Newspapers,"
#'   IEEE International Conference on Big Data, 2013.
#'
#' @examples
#' align_local("The answer is blowin' in the wind.",
#'             "As the Bob Dylan song says, the answer is blowing in the wind.")
#'
#' # Example of matching documents from a corpus
#' dir <- system.file("extdata/legal", package = "textreuse")
#' corpus <- TextReuseCorpus(dir = dir, progress = FALSE)
#' alignment <- align_local(corpus[["ca1851-match"]], corpus[["ny1850-match"]])
#' str(alignment)
#'
#' @export
align_local <- function(a, b, match = 2L, mismatch = -1L, gap = -1L,
                        edit_mark = "#", preserve_punctuation = FALSE,
                        progress = interactive()) {
 assert_that(identical(class(a), class(b)))
 UseMethod("align_local", a)
}

#' @export
align_local.TextReuseTextDocument <- function(a, b, match = 2L, mismatch = -1L,
                                              gap = -1L, edit_mark = "#",
                                              preserve_punctuation = FALSE,
                                              progress = interactive()) {
  align_local(content(a), content(b), match = match, mismatch = mismatch,
              gap = gap, edit_mark = edit_mark,
              preserve_punctuation = preserve_punctuation)
}

#' @export
align_local.default <- function(a, b, match = 2L, mismatch = -1L, gap = -1L,
                                edit_mark = "#", preserve_punctuation = FALSE,
                                progress = interactive()) {

  assert_that(is.string(a),
              is.string(b),
              is_integer_like(match),
              is_integer_like(mismatch),
              is_integer_like(gap),
              is.string(edit_mark),
              is.flag(preserve_punctuation))

  if (match <= 0 || mismatch > 0 || gap > 0 || !(str_length(edit_mark) == 1)) {
    stop("The scoring parameters should have the following characteristics:\n",
         "    - `match` should be a positive integer\n",
         "    - `mismatch` should be a negative integer or zero\n",
         "    - `gap` should be a negative integer or zero\n",
         "    - `edit_mark` should be a single character\n")
  }

  # Keep everything as integers because IntegerMatrix saves memory
  match    <- as.integer(match)
  mismatch <- as.integer(mismatch)
  gap      <- as.integer(gap)

  # Prepare the character vectors. Tokenize to words to compare word by word.
  # Use all lower case for the comparison, but use original capitalization in
  # the output.
  a_orig <- align_tokens(a, preserve_punctuation = preserve_punctuation)
  b_orig <- align_tokens(b, preserve_punctuation = preserve_punctuation)
  a <- normalize_alignment_tokens(a_orig)
  b <- normalize_alignment_tokens(b_orig)

  # Only show a progress bar for long computations
  n_rows <- length(b) + 1
  n_cols <- length(a) + 1
  if (n_rows * n_cols < 1e7) progress <- FALSE

  # Create the integer matrix
  if (progress) {
    message("Preparing a matrix with ",
            prettyNum(n_rows * n_cols, big.mark = ","),
            " elements.")
  }
  m <- matrix(0L, n_rows, n_cols)

  # Calculate the matrix of possible paths
  if (progress) message("Computing the optimal local alignment.")
  m <- sw_matrix(m, a, b, match, mismatch, gap, progress)

  # Find the starting place in the matrix
  alignment_score <- max(m)
  if (alignment_score == 0) {
    alignment <- list(a_edits = "", b_edits = "", score = alignment_score)
    class(alignment) <- c("textreuse_alignment", "list")
    return(alignment)
  }

  max_match <- which(m == alignment_score, arr.ind = TRUE, useNames = FALSE)

  if (nrow(max_match) > 1) {
    warning("Multiple optimal local alignments found; selecting only one of them.",
            call. = FALSE)
  }

  if (progress) message("Extracting the local alignment.")

  # Create output vectors which are as long as conceivably necessary
  a_out <- vector(mode = "character", length = max(max_match))
  b_out <- vector(mode = "character", length = max(max_match))
  a_out[] <- NA_character_
  b_out[] <- NA_character_

  # Initialize counters for the matrix and the output vector
  row_i <- max_match[1, 1]
  col_i <- max_match[1, 2]
  out_i <- 1L

  # Place our first known values in the output vectors
  b_out[out_i] <- b_orig[row_i - 1]
  a_out[out_i] <- a_orig[col_i - 1]
  out_i = out_i + 1L # Advance the out vector position

  # Begin moving up, left, or diagonally within the matrix till we hit a zero
  while (m[row_i - 1, col_i - 1] != 0) {

    # Values of the current cell, the cells up, left, diagonal, and the max
    up       <- m[row_i - 1, col_i]
    left     <- m[row_i, col_i - 1]
    diagn    <- m[row_i - 1, col_i - 1]
    max_cell <- max(up, left, diagn)

    # Move in the direction of the maximum cell. If there are ties, choose up
    # first, then left, then diagonal. Privilege up and left because they
    # preserve edits.
    #
    # In each case add the current words to the out vectors. For moves up and
    # and left there will be an insertion/deletion, so add a symbol like ####
    # that is the same number of characters as the word in the other vector.
    #
    # Note that the index of the matrix is offset by one from character vectors
    # a and b, so we use the row and column indices - 1. The column corresponds
    # to `a` and the rows correspond to `b`.
    if (up == max_cell) {
      row_i <- row_i - 1
      bword <- b_orig[row_i - 1]
      b_out[out_i] <- bword
      a_out[out_i] <- mark_chars(bword, edit_mark)
    } else if (left == max_cell) {
      col_i <- col_i - 1
      aword <- a_orig[col_i - 1]
      b_out[out_i] <- mark_chars(aword, edit_mark)
      a_out[out_i] <- aword
    } else if (diagn == max_cell) {
      row_i <- row_i - 1
      col_i <- col_i - 1
      bword <-  b_orig[row_i - 1]
      aword <- a_orig[col_i - 1]
      # Diagonals are a special case, because instead of an insertion or a
      # deletion we might have a substitution of words. If that is the case,
      # then treat it like a double insertion and deletion.
      if (a[col_i - 1] == b[row_i - 1]) {
        b_out[out_i] <- bword
        a_out[out_i] <- aword
      } else {
        b_out[out_i] <- bword
        a_out[out_i] <- mark_chars(bword, edit_mark)
        out_i <- out_i + 1
        b_out[out_i] <- mark_chars(aword, edit_mark)
        a_out[out_i] <- aword
      }
    }

    # Move forward one position in the out vectors, no matter which direction
    # we moved
    out_i <- out_i + 1

  }

  # Clean up the outputs
  b_out <- str_c(rev(b_out[!is.na(b_out)]), collapse = " ")
  a_out <- str_c(rev(a_out[!is.na(a_out)]), collapse = " ")

  # Create the alignment object
  alignment <- list(a_edits = a_out, b_edits = b_out, score = alignment_score)
  class(alignment) <- c("textreuse_alignment", "list")

  alignment

}

align_tokens <- function(x, preserve_punctuation) {
  if (!preserve_punctuation) return(tokenize_words(x, lowercase = FALSE))
  tokens <- str_split(str_squish(x), "\\s+")[[1]]
  tokens[tokens != ""]
}

normalize_alignment_tokens <- function(x) {
  str_to_lower(str_replace_all(x, "[[:punct:]]", ""))
}

#' @export
print.textreuse_alignment <- function(x, ...) {
  cat("TextReuse alignment\n")
  cat("Alignment score:", x$score, "\n")
  cat("Document A:\n")
  cat(str_wrap(x$a_edits, width = 72))
  cat("\n\nDocument B:\n")
  cat(str_wrap(x$b_edits, width = 72))
  cat("\n\n")
  invisible(x)
}


================================================
FILE: R/conversion-functions.R
================================================
#' Convert candidates data frames to other formats
#'
#' These functions convert a \code{textreuse_candidates} object to dense or
#' sparse matrices.
#'
#' @param x An object of class \code{\link[=lsh_compare]{textreuse_candidates}}.
#' @param ... Additional arguments.
#'
#' @return A similarity matrix with row and column names containing document IDs.
#'
#' @export
#' @method as.matrix textreuse_candidates
as.matrix.textreuse_candidates <- function(x, ...) {

  docs <- candidate_doc_ids(x)
  n <- length(docs)
  m <- matrix(0, n, n)
  rownames(m) <- docs
  colnames(m) <- docs
  diag(m) <- 1.0

  for (r in seq_len(nrow(x))) {
    a <- x$a[r]
    b <- x$b[r]
    score <- x$score[r]
    m[a, b] <- score
    m[b, a] <- score
  }

  m

}

#' @rdname as.matrix.textreuse_candidates
#' @export
as_sparse_matrix <- function(x) {
  assert_that(is_candidates_df(x))

  docs <- candidate_doc_ids(x)
  n <- length(docs)
  doc_ids <- stats::setNames(seq_along(docs), docs)

  rows <- seq_len(n)
  cols <- seq_len(n)
  values <- rep(1.0, n)

  if (nrow(x) > 0) {
    a <- unname(doc_ids[x$a])
    b <- unname(doc_ids[x$b])
    rows <- c(rows, a, b)
    cols <- c(cols, b, a)
    values <- c(values, x$score, x$score)
  }

  Matrix::sparseMatrix(i = rows, j = cols, x = values, dims = c(n, n),
                       use.last.ij = TRUE,
                       dimnames = list(docs, docs))
}

candidate_doc_ids <- function(x) {
  all_doc_ids <- attr(x, "all-doc-ids")
  if (is.null(all_doc_ids)) {
    all_doc_ids <- c(x$a, x$b)
  }
  sort(unique(all_doc_ids))
}


================================================
FILE: R/filenames.R
================================================
#' Filenames from paths
#'
#' This function takes a character vector of paths and returns just the file
#' name, by default without the extension. A \code{\link{TextReuseCorpus}} uses
#' the paths to the files in the corpus as the names of the list. This function
#' is intended to turn those paths into more manageable identifiers.
#'
#' @param paths A character vector of paths.
#' @param extension Should the file extension be preserved?
#' @seealso \code{\link{basename}}
#' @examples
#' paths <- c("corpus/one.txt", "corpus/two.md", "corpus/three.text")
#' filenames(paths)
#' filenames(paths, extension = TRUE)
#' @export
filenames <- function(paths, extension = FALSE) {
  assert_that(is.character(paths))
  f <- basename(paths)
  if (extension)
    return(f)
  else
    str_replace(f, "\\.[:alpha:]{1,}$", "")
}


================================================
FILE: R/lsh.R
================================================
#'Locality sensitive hashing for minhash
#'
#'Locality sensitive hashing (LSH) discovers potential matches among a corpus of
#'documents quickly, so that only likely pairs can be compared.
#'
#'@details Locality sensitive hashing is a technique for detecting document
#'  similarity that does not require pairwise comparisons. When comparing pairs
#'  of documents, the number of pairs grows rapidly, so that only the smallest
#'  corpora can be compared pairwise in a reasonable amount of computation time.
#'  Locality sensitive hashing, on the other hand, takes a document which has
#'  been tokenized and hashed using a minhash algorithm. (See
#'  \code{\link{minhash_generator}}.) Each set of minhash signatures is then
#'  broken into bands comprised of a certain number of rows. (For example, 200
#'  minhash signatures might be broken down into 20 bands each containing 10
#'  rows.) Each band is then hashed to a bucket. Documents with identical rows
#'  in a band will be hashed to the same bucket. The likelihood that a document
#'  will be marked as a potential duplicate is proportional to the number of
#'  bands and inversely proportional to the number of rows in each band.
#'
#'  This function returns a data frame with the additional class
#'  \code{lsh_buckets}. The LSH technique only requires that the signatures for
#'  each document be calculated once. So it is possible, as long as one uses the
#'  same minhash function and the same number of bands, to combine the outputs
#'  from this function at different times. The output can thus be treated as a
#'  kind of cache of LSH signatures.
#'
#'  To extract pairs of documents from the output of this function, see
#'  \code{\link{lsh_candidates}}.
#'
#'@param x A \code{\link{TextReuseCorpus}} or
#'  \code{\link{TextReuseTextDocument}}.
#'@param bands The number of bands to use for locality sensitive hashing. The
#'  number of hashes in the documents in the corpus must be evenly divisible by
#'  the number of bands. See \code{\link{lsh_threshold}} and
#'  \code{\link{lsh_probability}} for guidance in selecting the number of bands
#'  and hashes.
#'@param progress Display a progress bar while comparing documents.
#'
#'@return A data frame (with the additional class \code{lsh_buckets}),
#'  containing a column with the document IDs and a column with their LSH
#'  signatures, or buckets.
#'
#'@references Jure Leskovec, Anand Rajaraman, and Jeff Ullman,
#'  \emph{Mining of Massive Datasets} (Cambridge University Press, 2011), ch.
#'  3. See also Matthew Casperson,
#'  "\href{http://matthewcasperson.blogspot.com/2013/11/minhash-for-dummies.html}{Minhash
#'   for Dummies}" (November 14, 2013).
#'
#'@seealso \code{\link{minhash_generator}}, \code{\link{lsh_add}},
#'  \code{\link{lsh_candidates}}, \code{\link{lsh_query}},
#'  \code{\link{lsh_probability}},
#'  \code{\link{lsh_threshold}}
#'
#' @examples
#' dir <- system.file("extdata/legal", package = "textreuse")
#' minhash <- minhash_generator(200, seed = 235)
#' corpus <- TextReuseCorpus(dir = dir,
#'                           tokenizer = tokenize_ngrams, n = 5,
#'                           minhash_func = minhash)
#' buckets <- lsh(corpus, bands = 50)
#' buckets
#'@export
lsh <- function(x, bands, progress = interactive()) {
  UseMethod("lsh", x)
}

#' Add documents to a LSH cache
#'
#' This function adds buckets for one or more new documents to an existing
#' \code{lsh_buckets} object. Use the same \code{bands} value and minhash
#' function that were used to create the original buckets.
#'
#' @param buckets An \code{lsh_buckets} object created by \code{\link{lsh}}.
#' @param x A \code{\link{TextReuseCorpus}} or
#'   \code{\link{TextReuseTextDocument}} with minhashes.
#' @inheritParams lsh
#' @return An updated \code{lsh_buckets} object.
#' @seealso \code{\link{lsh}}, \code{\link{lsh_query}},
#'   \code{\link{lsh_candidates}}
#' @export
lsh_add <- function(buckets, x, bands, progress = interactive()) {
  assert_that(is_lsh_buckets(buckets))

  new_buckets <- lsh(x, bands = bands, progress = progress)
  new_doc_ids <- unique(new_buckets$doc)

  buckets <- buckets %>%
    dplyr::filter(!.data$doc %in% new_doc_ids) %>%
    dplyr::bind_rows(new_buckets) %>%
    dplyr::arrange(.data$doc)

  class(buckets) <- c("lsh_buckets", setdiff(class(buckets), "lsh_buckets"))

  buckets
}

#' @export
lsh.TextReuseCorpus <- function(x, bands, progress = interactive()) {

  assert_that(is.count(bands),
              has_minhashes_corpus(x))

  h <- length(minhashes(x[[1]])) # number of hashes
  d <- length(x) # number of documents
  r <- h / bands # number of rows

  assert_that(check_banding(h, bands))

  # To assign rows in data frame to bands
  b_assign <-  tibble::tibble(band =
      rep(vapply(1:bands, function(i) rep(i, r), integer(r)), d)
    )

  all_minhashes <- minhashes(x)
  col_names <- names(all_minhashes)

  buckets <- all_minhashes %>%
    tibble::as_tibble() %>%
    tidyr::gather("doc", "hash", col_names) %>%
    dplyr::mutate(doc = as.character(.data$doc)) %>%
    dplyr::bind_cols(b_assign) %>%
    dplyr::group_by(.data$doc, .data$band)

    rm(b_assign)

    if (progress) {
      message("Calculating LSH buckets")
      pb <- txtProgressBar(min = 0, max = d * bands, style = 3)
    }

  # include the band in the signature hash to avoid false matches
  buckets <- buckets %>%
    dplyr::summarize(buckets = digest_progress(list(hash, unique(band)),
                                                 pb, progress))

  if (progress) close(pb)

  buckets <- buckets %>%
    dplyr::select(-.data$band) %>%
    dplyr::ungroup()

  class(buckets) <- c("lsh_buckets", class(buckets))

  buckets

}

# A wrapper around digest to be able to use the progress bar
digest_progress <- function(x, pb, progress) {
  bucket <- digest::digest(x)
  if (progress) setTxtProgressBar(pb, getTxtProgressBar(pb) + 1)
  bucket
}

#' @export
lsh.TextReuseTextDocument <- function(x, bands, progress) {

  assert_that(is.count(bands),
              has_minhashes(x))

  all_minhashes <- minhashes(x)
  h <- length(all_minhashes) # number of hashes
  r <- h / bands # number of rows

  assert_that(check_banding(h, bands))

  # To assign rows in data frame to bands
  b_assign <-  tibble::tibble(band =
      rep(vapply(1:bands, function(i) rep(i, r), integer(r)), 1)
    )


  buckets <- tibble::tibble(doc = x$meta$id, hash = all_minhashes) %>%
    dplyr::bind_cols(b_assign) %>%
    dplyr::group_by(.data$doc, .data$band) %>%
    dplyr::summarize(buckets = digest::digest(list(hash, unique(band)))) %>%
    dplyr::select(-.data$band) %>%
    dplyr::ungroup()

  class(buckets) <- c("lsh_buckets", class(buckets))

  buckets

}


================================================
FILE: R/lsh_candidates.R
================================================
#' Candidate pairs from LSH comparisons
#'
#' Given a data frame of LSH buckets returned from \code{\link{lsh}}, this
#' function returns the potential candidates.
#'
#' @param buckets A data frame returned from \code{\link{lsh}}.
#'
#' @return A data frame of candidate pairs.
#'
#' @examples
#' dir <- system.file("extdata/legal", package = "textreuse")
#' minhash <- minhash_generator(200, seed = 234)
#' corpus <- TextReuseCorpus(dir = dir,
#'                           tokenizer = tokenize_ngrams, n = 5,
#'                           minhash_func = minhash)
#' buckets <- lsh(corpus, bands = 50)
#' lsh_candidates(buckets)
#'
#' @export
lsh_candidates <- function(buckets) {
  assert_that(is_lsh_buckets(buckets))

  candidates <- buckets %>%
    dplyr::left_join(buckets, by = "buckets") %>%
    dplyr::filter(.data$doc.x != .data$doc.y) %>%
    dplyr::distinct(doc.x, doc.y) %>%
    dplyr::arrange(.data$doc.x, .data$doc.y) %>%
    dplyr::mutate(dn = pmin(.data$doc.x, .data$doc.y),
                  up = pmax(.data$doc.x, .data$doc.y)) %>%
    dplyr::distinct(.data$up, .data$dn) %>%
    dplyr::select(a = dn, b = up) %>%
    dplyr::arrange(.data$a, .data$b) %>%
    dplyr::mutate(score = NA_real_)

  class(candidates) <- c("textreuse_candidates", class(candidates))

  candidates

}


================================================
FILE: R/lsh_compare.R
================================================
#' Compare candidates identified by LSH
#'
#' The \code{\link{lsh_candidates}} only identifies potential matches, but
#' cannot estimate the actual similarity of the documents. This function takes a
#' data frame returned by \code{\link{lsh_candidates}} and applies a comparison
#' function to each of the documents in a corpus, thereby calculating the
#' document similarity score. Note that since your corpus will have minhash
#' signatures rather than hashes for the tokens itself, you will probably wish
#' to use \code{\link{tokenize}} to calculate new hashes. This can be done for
#' just the potentially similar documents. See the package vignettes for
#' details.
#'
#' @param candidates A data frame returned by \code{\link{lsh_candidates}}.
#' @param corpus The same \code{\link{TextReuseCorpus}} corpus which was used to generate the candidates.
#' @param f A comparison function such as \code{\link{jaccard_similarity}}.
#' @param progress Display a progress bar while comparing documents. Progress
#'   bars are disabled when using parallel processing.
#' @return A data frame with values calculated for \code{score}.
#' @examples
#' dir <- system.file("extdata/legal", package = "textreuse")
#' minhash <- minhash_generator(200, seed = 234)
#' corpus <- TextReuseCorpus(dir = dir,
#'                           tokenizer = tokenize_ngrams, n = 5,
#'                           minhash_func = minhash)
#' buckets <- lsh(corpus, bands = 50)
#' candidates <- lsh_candidates(buckets)
#' lsh_compare(candidates, corpus, jaccard_similarity)
#' @export
lsh_compare <- function(candidates, corpus, f, progress = interactive()) {
  assert_that(is_candidates_df(candidates),
              is.function(f),
              is.TextReuseCorpus(corpus))

  rows_to_score <- which(is.na(candidates$score))
  num_rows <- length(rows_to_score)
  use_parallel <- using_parallel()

  if (num_rows == 0) {
    attr(candidates, "all-doc-ids") <- names(corpus)
    return(candidates)
  }

  if (progress) {
    message("Making ", prettyNum(num_rows, big.mark = ","),
            " comparisons.")
    if (!use_parallel) {
      pb <- txtProgressBar(min = 0, max = num_rows, style = 3)
    }
  }

  apply_fun <- get_apply_function()
  scores <- apply_fun(seq_along(rows_to_score), function(j) {
    i <- rows_to_score[j]
    a <- candidates$a[i]
    b <- candidates$b[i]
    score <- f(corpus[[a]], corpus[[b]])
    if (progress && !use_parallel) setTxtProgressBar(pb, j)
    score
  })

  candidates$score[rows_to_score] <- unlist(scores, use.names = FALSE)

  if (progress && !use_parallel) close(pb)

  attr(candidates, "all-doc-ids") <- names(corpus)

  candidates
}


================================================
FILE: R/lsh_probability.R
================================================
#' Probability that a candidate pair will be detected with LSH
#'
#' Functions to help choose the correct parameters for the \code{\link{lsh}} and
#' \code{\link{minhash_generator}} functions. Use \code{lsh_threshold} to
#' determine the minimum Jaccard similarity for two documents for them to likely
#' be considered a match. Use \code{lsh_probability} to determine the
#' probability that a pair of documents with a known Jaccard similarity will be
#' detected.
#'
#' @param h The number of minhash signatures.
#' @param b The number of LSH bands.
#' @param s The Jaccard similarity.
#' @details Locality sensitive hashing returns a list of possible matches for
#' similar documents. How likely is it that a pair of documents will be detected
#' as a possible match? If \code{h} is the number of minhash signatures,
#' \code{b} is the number of bands in the LSH function (implying then that the
#' number of rows \code{r = h / b}), and \code{s} is the actual Jaccard
#' similarity of the two documents, then the probability \code{p} that the two
#' documents will be marked as a candidate pair is given by this equation.
#'
#' \deqn{p = 1 - (1 - s^{r})^{b}}
#'
#' According to \href{http://infolab.stanford.edu/~ullman/mmds/book.pdf}{MMDS},
#' that equation approximates an S-curve. This implies that there is a threshold
#' (\code{t}) for \code{s} approximated by this equation.
#'
#' \deqn{t = \frac{1}{b}^{\frac{1}{r}}}
#'
#' @references Jure Leskovec, Anand Rajaraman, and Jeff Ullman,
#'  \emph{Mining of Massive Datasets} (Cambridge University Press, 2011), ch. 3.
#' @examples
#' # Threshold for default values
#' lsh_threshold(h = 200, b = 40)
#'
#' # Probability for varying values of s
#' lsh_probability(h = 200, b = 40, s = .25)
#' lsh_probability(h = 200, b = 40, s = .50)
#' lsh_probability(h = 200, b = 40, s = .75)
#' @export
lsh_probability <- function(h, b, s) {
  assert_that(is.count(h),
              is.count(b),
              check_banding(h, b),
              is.number(s))
  1 - (1 - s ^ (h / b)) ^ b
}

#' @rdname lsh_probability
#' @export
lsh_threshold <- function(h, b) {
  assert_that(is.count(h),
              is.count(b),
              check_banding(h, b))
  (1 / b ) ^ (1 / (h / b))
}


================================================
FILE: R/lsh_query.R
================================================
#' Query a LSH cache for matches to a single document
#'
#' This function retrieves the matches for a single document from an \code{lsh_buckets} object created by \code{\link{lsh}}. See \code{\link{lsh_candidates}} to retrieve all pairs of matches.
#'
#' @param buckets An \code{lsh_buckets} object created by \code{\link{lsh}}.
#' @param id The document ID to find matches for.
#'
#' @return An \code{lsh_candidates} data frame with matches to the document specified.
#'
#' @examples
#' dir <- system.file("extdata/legal", package = "textreuse")
#' minhash <- minhash_generator(200, seed = 235)
#' corpus <- TextReuseCorpus(dir = dir,
#'                           tokenizer = tokenize_ngrams, n = 5,
#'                           minhash_func = minhash)
#' buckets <- lsh(corpus, bands = 50)
#' lsh_query(buckets, "ny1850-match")
#'
#' @seealso \code{\link{lsh}}, \code{\link{lsh_candidates}}
#' @export
lsh_query <- function(buckets, id) {
  assert_that(is_lsh_buckets(buckets),
              is.string(id))

  signatures <- buckets %>%
    dplyr::filter(.data$doc == id) %>%
    `$`("buckets")

  docs <- buckets %>%
    dplyr::filter(.data$buckets %in% signatures) %>%
    `$`("doc")

  res <- tibble::tibble(a = id, b = docs, score = NA_real_) %>%

    dplyr::filter(.data$a != .data$b) %>%
    dplyr::distinct(.data$a, .data$b)

  class(res) <- c("textreuse_candidates", class(res))

  res
}


================================================
FILE: R/lsh_subset.R
================================================
#' List of all candidates in a corpus
#'
#' @param candidates A data frame of candidate pairs from
#'   \code{\link{lsh_candidates}}.
#' @return A character vector of document IDs from the candidate pairs, to be
#'   used to subset the \code{\link{TextReuseCorpus}}.
#' @examples
#' dir <- system.file("extdata/legal", package = "textreuse")
#' minhash <- minhash_generator(200, seed = 234)
#' corpus <- TextReuseCorpus(dir = dir,
#'                           tokenizer = tokenize_ngrams, n = 5,
#'                           minhash_func = minhash)
#' buckets <- lsh(corpus, bands = 50)
#' candidates <- lsh_candidates(buckets)
#' lsh_subset(candidates)
#' corpus[lsh_subset(candidates)]
#' @export
lsh_subset <- function(candidates) {
  assert_that(is_candidates_df(candidates))
  sort(unique(c(candidates$a, candidates$b)))
}


================================================
FILE: R/minhash.R
================================================
#' Generate a minhash function
#'
#' A minhash value is calculated by hashing the strings in a character vector to
#' integers and then selecting the minimum value. Repeated minhash values are
#' generated by using different hash functions: these different hash functions
#' are created by using performing a bitwise \code{XOR} operation
#' (\code{\link{bitwXor}}) with a vector of random integers. Since it is vital
#' that the same random integers be used for each document, this function
#' generates another function which will always use the same integers. The
#' returned function is intended to be passed to the \code{hash_func} parameter
#' of \code{\link{TextReuseTextDocument}}.
#'
#' @param n The number of minhashes that the returned function should generate.
#' @param seed An option parameter to set the seed used in generating the random
#'   numbers to ensure that the same minhash function is used on repeated
#'   applications.
#' @return A function which will take a character vector and return \code{n}
#'   minhashes.
#' @references Jure Leskovec, Anand Rajaraman, and Jeff Ullman,
#'   \emph{Mining of Massive Datasets} (Cambridge University Press, 2011), ch.
#'   3. See also Matthew Casperson,
#'   "\href{http://matthewcasperson.blogspot.com/2013/11/minhash-for-dummies.html}{Minhash
#'    for Dummies}" (November 14, 2013).
#' @seealso \code{\link{lsh}}
#' @examples
#' set.seed(253)
#' minhash <- minhash_generator(10)
#'
#' # Example with a TextReuseTextDocument
#' file <- system.file("extdata/legal/ny1850-match.txt", package = "textreuse")
#' doc <- TextReuseTextDocument(file = file, hash_func = minhash,
#'                              keep_tokens = TRUE)
#' hashes(doc)
#'
#' # Example with a character vector
#' is.character(tokens(doc))
#' minhash(tokens(doc))
#' @export
minhash_generator <- function(n = 200, seed = NULL) {
  assert_that(is.count(n))
  if (!is.null(seed)) set.seed(seed)
  r <- random_ints(n)
  f <- function(x) {
    assert_that(is.character(x))
    h <- hash_string(x)
    vapply(r, function(i) { min(bitwXor(h, i)) },
           integer(1), USE.NAMES = FALSE)
  }
  f
}

# Generate random integers for minhashing
#
# It is crucial that you use the same random integers for every document in the
# corpus. The random integers generated by this function are intended to be
# passed to \code{\link{minhash}}.
# @param n The number of random integers to generate.
# @return A vector of integers
# @seealso \code{\link{minhash}}
# @examples
# random_ints(3)
random_ints <- function(n) {
  as.integer(stats::runif(n, -2147483648, 2147483647))
}


================================================
FILE: R/pairwise_candidates.R
================================================
#' Candidate pairs from pairwise comparisons
#'
#' Converts a comparison matrix generated by \code{\link{pairwise_compare}} into a
#' data frame of candidates for matches.
#'
#' @param m A matrix from \code{\link{pairwise_compare}}.
#' @param directional Should be set to the same value as in
#'   \code{\link{pairwise_compare}}.
#' @return A data frame containing all the non-\code{NA} values from \code{m}.
#'   Columns \code{a} and \code{b} are the IDs from the original corpus as
#'   passed to the comparison function. Column \code{score} is the score
#'   returned by the comparison function.
#' @examples
#' dir <- system.file("extdata/legal", package = "textreuse")
#' corpus <- TextReuseCorpus(dir = dir)
#'
#' m1 <- pairwise_compare(corpus, ratio_of_matches, directional = TRUE)
#' pairwise_candidates(m1, directional = TRUE)
#'
#' m2 <- pairwise_compare(corpus, jaccard_similarity)
#' pairwise_candidates(m2)
#' @export
pairwise_candidates <- function(m, directional = FALSE) {
  assert_that(is.matrix(m))
  matches <- which(!is.na(m))
  indexes <- arrayInd(matches, dim(m))
  score <- m[matches]
  a <- rownames(m)[indexes[ , 1]]
  b <- colnames(m)[indexes[ , 2]]
  df <- data.frame(a = a, b = b, score = score, stringsAsFactors = FALSE)
  if (!directional) df <- sort_df_by_rows(df)
  df <- sort_df_by_columns(df)
  class(df) <- c("textreuse_candidates", "tbl_df", "tbl", "data.frame")
  df
}


================================================
FILE: R/pairwise_compare.R
================================================
#' Pairwise comparisons among documents in a corpus
#'
#' Given a \code{\link{TextReuseCorpus}} containing documents of class
#' \code{\link{TextReuseTextDocument}}, this function applies a comparison
#' function to every pairing of documents, and returns a matrix with the
#' comparison scores.
#'
#' @param corpus A \code{\link{TextReuseCorpus}}.
#' @param f The function to apply to \code{x} and \code{y}.
#' @param ... Additional arguments passed to \code{f}.
#' @param directional Some comparison functions are commutative, so that
#'   \code{f(a, b) == f(b, a)} (e.g., \code{\link{jaccard_similarity}}). Other
#'   functions are directional, so that \code{f(a, b)} measures \code{a}'s
#'   borrowing from \code{b}, which may not be the same as \code{f(b, a)} (e.g.,
#'   \code{\link{ratio_of_matches}}). If \code{directional} is \code{FALSE},
#'   then only the minimum number of comparisons will be made, i.e., the upper
#'   triangle of the matrix. If \code{directional} is \code{TRUE}, then both
#'   directional comparisons will be measured. In no case, however, will
#'   documents be compared to themselves, i.e., the diagonal of the matrix.
#' @param progress Display a progress bar while comparing documents.
#'
#' @return A square matrix with dimensions equal to the length of the corpus,
#'   and row and column names set by the names of the documents in the corpus. A
#'   value of \code{NA} in the matrix indicates that a comparison was not made.
#'   In cases of directional comparisons, then the comparison reported is
#'   \code{f(row, column)}.
#'
#' @seealso See these document comparison functions,
#'   \code{\link{jaccard_similarity}}, \code{\link{ratio_of_matches}}.
#'
#' @examples
#' dir <- system.file("extdata/legal", package = "textreuse")
#' corpus <- TextReuseCorpus(dir = dir)
#' names(corpus) <- filenames(names(corpus))
#'
#' # A non-directional comparison
#' pairwise_compare(corpus, jaccard_similarity)
#'
#' # A directional comparison
#' pairwise_compare(corpus, ratio_of_matches, directional = TRUE)
#' @export
pairwise_compare <- function(corpus, f, ..., directional = FALSE,
                        progress = interactive()) {
  assert_that(is.TextReuseCorpus(corpus),
              is.function(f))

  len <- length(corpus)
  ids <- names(corpus)

  m <- matrix(0, len, len, dimnames = list(ids, ids))

  if (!directional)
    m[lower.tri(m, diag = TRUE)] <- NA
  else
    diag(m) <- NA


  if (progress) {
    num_pairs <- sum(!is.na(m))
    message("Making ", prettyNum(num_pairs, big.mark = ","), " comparisons.")
    pb <- txtProgressBar(min = 0, max = num_pairs, style = 3)
  }

  for (i in seq_along(m)) {
    if (is.na(m[i])) next
    indexes <- arrayInd(i, dim(m))
    m[indexes] <- f(corpus[[indexes[1]]], corpus[[indexes[2]]])
    if (progress) setTxtProgressBar(pb, getTxtProgressBar(pb) + 1)
  }

  if (progress) close(pb)

  m

}


================================================
FILE: R/parallel.R
================================================
# Check if the option `mc.cores` has been set. If it has, return `mclapply`
# instead of `lapply`. But in no circumstances use `mclapply` on Windows.
using_parallel <- function() {
  cores_set <- !is.null(getOption("mc.cores"))
  windows <- .Platform$OS.type == "windows"
  cores_set && !windows
}

get_apply_function <- function() {
 if (using_parallel())
   return(parallel::mclapply)
 else
   return(lapply)
}


================================================
FILE: R/rehash.R
================================================
#' Recompute the hashes for a document or corpus
#'
#' Given a \code{\link{TextReuseTextDocument}} or a
#' \code{\link{TextReuseCorpus}}, this function recomputes either the hashes or
#' the minhashes with the function specified. This implies that you have
#' retained the tokens with the \code{keep_tokens = TRUE} parameter.
#'
#' @param x A \code{\link{TextReuseTextDocument}} or
#'   \code{\link{TextReuseCorpus}}.
#' @param func A function to either hash the tokens or to generate the minhash
#'   signature. See \code{\link{hash_string}}, \code{\link{minhash_generator}}.
#' @param type Recompute the \code{hashes} or \code{minhashes}?
#'
#' @return The modified \code{\link{TextReuseTextDocument}} or
#'   \code{\link{TextReuseCorpus}}.
#'
#' @examples
#' dir <- system.file("extdata/legal", package = "textreuse")
#' minhash1 <- minhash_generator(seed = 1)
#' corpus <- TextReuseCorpus(dir = dir, minhash_func = minhash1, keep_tokens = TRUE)
#' head(minhashes(corpus[[1]]))
#' minhash2 <- minhash_generator(seed = 2)
#' corpus <- rehash(corpus, minhash2, type = "minhashes")
#' head(minhashes(corpus[[2]]))
#'
#' @export
rehash <- function(x, func, type = c("hashes", "minhashes")) {
  UseMethod("rehash", x)
}

#' @export
rehash.TextReuseTextDocument <- function(x, func,
                                         type = c("hashes", "minhashes")) {
  assert_that(has_tokens(x),
              is.function(func))
  type <- match.arg(type)

  if (type == "hashes") {
    x$hashes <- func(x$tokens)
    x$meta$hash_func <- as.character(substitute(func))
  } else if (type == "minhashes") {
    x$minhashes <- func(x$tokens)
    x$meta$minhash_func <- as.character(substitute(func))
  }
  x
}

#' @export
rehash.TextReuseCorpus <- function(x, func,  type = c("hashes", "minhashes")) {
  assert_that(is.function(func))
  type <- match.arg(type)
  apply_func <- get_apply_function()
  x$documents <- apply_func(x$documents, rehash, func, type)
  if (type == "hashes")
    x$meta$hash_func <- as.character(substitute(func))
  else if (type == "minhashes")
    x$meta$minhash_func <- as.character(substitute(func))
  x
}


================================================
FILE: R/similarity.R
================================================
#' Measure similarity/dissimilarity in documents
#'
#' A set of functions which take two sets or bag of words and measure their
#' similarity or dissimilarity.
#'
#' @details The functions \code{jaccard_similarity} and
#'   \code{jaccard_dissimilarity} provide the Jaccard measures of similarity or
#'   dissimilarity for two sets. The coefficients will be numbers between
#'   \code{0} and \code{1}. For the similarity coefficient, the higher the
#'   number the more similar the two sets are. When applied to two documents of
#'   class \code{\link{TextReuseTextDocument}}, the hashes in those documents
#'   are compared. But this function can be passed objects of any class accepted
#'   by the set functions in base R. So it is possible, for instance, to pass
#'   this function two character vectors comprised of word, line, sentence, or
#'   paragraph tokens, or those character vectors hashed as integers.
#'
#'   The Jaccard similarity coeffecient is defined as follows:
#'
#'   \deqn{J(A, B) = \frac{ | A \cap B | }{ | A \cup B | }}{ length(intersect(a,
#'   b)) / length(union(a, b))}
#'
#'   The Jaccard dissimilarity is simply
#'
#'   \deqn{1 - J(A, B)}
#'
#'   The function \code{jaccard_bag_similarity} treats \code{a} and \code{b} as
#'   bags rather than sets, so that the result is a fraction where the numerator
#'   is the sum of each matching element counted the minimum number of times it
#'   appears in each bag, and the denominator is the sum of the lengths of both
#'   bags. The maximum value for the Jaccard bag similarity is \code{0.5}.
#'
#'   The function \code{ratio_of_matches} finds the ratio between the number of
#'   items in \code{b} that are also in \code{a} and the total number of items
#'   in \code{b}. Note that this similarity measure is directional: it measures
#'   how much \code{b} borrows from \code{a}, but says nothing about how much of
#'   \code{a} borrows from \code{b}.
#'
#'   The function \code{count_matches} returns the numerator used by
#'   \code{ratio_of_matches}: the number of items in \code{b} also found in
#'   \code{a}. The function \code{matching_tokens} returns those matching items
#'   from \code{b}, preserving their order and duplicates.
#'
#' @param a The first set (or bag) to be compared. The origin bag for
#'   directional comparisons.
#' @param b The second set (or bag) to be compared. The destination bag for
#'   directional comparisons.
#'
#' @examples
#' jaccard_similarity(1:6, 3:10)
#' jaccard_dissimilarity(1:6, 3:10)
#'
#' a <- c("a", "a", "a", "b")
#' b <- c("a", "a", "b", "b", "c")
#' jaccard_similarity(a, b)
#' jaccard_bag_similarity(a, b)
#' ratio_of_matches(a, b)
#' ratio_of_matches(b, a)
#' count_matches(a, b)
#' matching_tokens(a, b)
#'
#' ny         <- system.file("extdata/legal/ny1850-match.txt", package = "textreuse")
#' ca_match   <- system.file("extdata/legal/ca1851-match.txt", package = "textreuse")
#' ca_nomatch <- system.file("extdata/legal/ca1851-nomatch.txt", package = "textreuse")
#'
#' ny         <- TextReuseTextDocument(file = ny,
#'                                     meta = list(id = "ny"))
#' ca_match   <- TextReuseTextDocument(file = ca_match,
#'                                     meta = list(id = "ca_match"))
#' ca_nomatch <- TextReuseTextDocument(file = ca_nomatch,
#'                                     meta = list(id = "ca_nomatch"))
#'
#' # These two should have higher similarity scores
#' jaccard_similarity(ny, ca_match)
#' ratio_of_matches(ny, ca_match)
#'
#' # These two should have lower similarity scores
#' jaccard_similarity(ny, ca_nomatch)
#' ratio_of_matches(ny, ca_nomatch)
#'
#' @references Jure Leskovec, Anand Rajaraman, and Jeff Ullman,
#'   \emph{Mining of Massive Datasets} (Cambridge University Press, 2011).
#' @name similarity-functions
NULL

#' @rdname similarity-functions
#' @export
jaccard_similarity <- function(a, b) UseMethod("jaccard_similarity")

#' @export
jaccard_similarity.default <- function(a, b) {
  assert_that(all(class(a) == class(b)))
  length(intersect(a, b)) / length(union(a, b))
}

#' @export
jaccard_similarity.TextReuseTextDocument <- function(a, b) {
  assert_that(all(class(a) == class(b)))
  jaccard_similarity(a$hashes, b$hashes)
}

#' @rdname similarity-functions
#' @export
jaccard_dissimilarity <- function(a, b) UseMethod("jaccard_dissimilarity")

#' @export
jaccard_dissimilarity.default <- function(a, b) {
  1 - jaccard_similarity(a, b)
}

#' @rdname similarity-functions
#' @export
jaccard_bag_similarity <- function(a, b) UseMethod("jaccard_bag_similarity")

#' @export
jaccard_bag_similarity.default <- function(a, b) {
  matches <- intersect(a, b)
  counts <- vapply(matches, function(x) min(sum(x == a), sum(x == b)),
                   integer(1), USE.NAMES = FALSE)
  denominator <- length(a) + length(b)
  sum(counts) / denominator
}

#' @export
jaccard_bag_similarity.TextReuseTextDocument <- function(a, b) {
  assert_that(all(class(a) == class(b)))
  jaccard_bag_similarity(a$hashes, b$hashes)
}

#' @export
#' @rdname similarity-functions
ratio_of_matches <- function(a, b) UseMethod("ratio_of_matches")

#' @export
ratio_of_matches.default <- function(a, b) {
  assert_that(all(class(a) == class(b)))
  sum(b %in% a) / length(b)
}

#' @export
ratio_of_matches.TextReuseTextDocument <- function(a, b) {
  assert_that(all(class(a) == class(b)))
  ratio_of_matches(a$hashes, b$hashes)
}

#' @export
#' @rdname similarity-functions
count_matches <- function(a, b) UseMethod("count_matches")

#' @export
count_matches.default <- function(a, b) {
  length(matching_tokens(a, b))
}

#' @export
count_matches.TextReuseTextDocument <- function(a, b) {
  assert_that(all(class(a) == class(b)))
  count_matches(a$hashes, b$hashes)
}

#' @export
#' @rdname similarity-functions
matching_tokens <- function(a, b) UseMethod("matching_tokens")

#' @export
matching_tokens.default <- function(a, b) {
  assert_that(all(class(a) == class(b)))
  b[b %in% a]
}

#' @export
matching_tokens.TextReuseTextDocument <- function(a, b) {
  assert_that(all(class(a) == class(b)),
              has_tokens(a),
              has_tokens(b))
  matching_tokens(a$tokens, b$tokens)
}


================================================
FILE: R/textreuse-package.r
================================================
#' @details
#' The best place to begin with this package in the introductory vignette.
#'
#' \code{vignette("textreuse-introduction", package = "textreuse")}
#'
#' After reading that vignette, the "pairwise" and "minhash" vignettes introduce
#' specific paths for working with the package.
#'
#' \code{vignette("textreuse-pairwise", package = "textreuse")}
#'
#' \code{vignette("textreuse-minhash", package = "textreuse")}
#'
#' \code{vignette("textreuse-alignment", package = "textreuse")}
#'
#' Another good place to begin with the package is the documentation for loading
#' documents (\code{\link{TextReuseTextDocument}} and
#' \code{\link{TextReuseCorpus}}), for \link{tokenizers},
#' \link[=similarity-functions]{similarity functions}, and
#' \link[=lsh]{locality-sensitive hashing}.
#'
#' @references The sample data provided in the \code{extdata/ats} directory
#'   contains nineteenth-century American Tract Society publications gathered
#'   from the \href{https://archive.org/}{Internet Archive}.
#'
#'   The sample data provided in the \code{extdata/legal} directory, are taken
#'   from the following nineteenth-century codes of civil procedure from
#'   California and New York.
#'
#'   \emph{Final Report of the Commissioners on Practice and Pleadings}, in 2
#'   \emph{Documents of the Assembly of New York}, 73rd Sess., No. 16, (1850):
#'   243-250, sections 597-613.
#'   \href{http://books.google.com/books?id=9HEbAQAAIAAJ&pg=PA243#v=onepage&q&f=false}{Google
#'    Books}.
#'
#'   \emph{An Act To Regulate Proceedings in Civil Cases}, 1851 \emph{California
#'   Laws} 51, 51-53 sections 4-17; 101, sections 313-316.
#'   \href{http://books.google.com/books?id=4PHEAAAAIAAJ&pg=PA51#v=onepage&q&f=false}{Google
#'    Books}.
#'
#' @useDynLib textreuse, .registration = TRUE
#' @importFrom Rcpp sourceCpp
#' @import RcppProgress
#' @import stringr
#' @import assertthat
#' @importFrom utils getTxtProgressBar setTxtProgressBar txtProgressBar
"_PACKAGE"

if (getRversion() >= "2.15.1") {
  utils::globalVariables(c("doc.x", "doc.y", "up", "dn", "a", "b", ".data", "band", "hash"))
}


================================================
FILE: R/token_index.R
================================================
#' Build an index of tokens and documents
#'
#' Build an inverted index from tokens to the documents that contain them. This
#' is useful for finding document pairs that share one or more n-grams without
#' comparing every document pair. The corpus must be created with
#' \code{keep_tokens = TRUE}.
#'
#' @param corpus A \code{\link{TextReuseCorpus}} with retained tokens.
#' @param min_doc_count Minimum number of documents a token must appear in to
#'   be retained. Increase this to remove rare tokens.
#' @param max_doc_count Maximum number of documents a token may appear in to be
#'   retained. Decrease this to remove very common tokens.
#' @return A \code{textreuse_token_index} data frame with columns \code{token},
#'   \code{docs}, and \code{n_docs}.
#' @export
token_index <- function(corpus, min_doc_count = 2, max_doc_count = Inf) {
  assert_that(is.TextReuseCorpus(corpus),
              is.count(min_doc_count),
              is.number(max_doc_count),
              all(vapply(tokens(corpus), Negate(is.null), logical(1))))

  entries <- lapply(names(corpus), function(doc_id) {
    tibble::tibble(token = unique(tokens(corpus[[doc_id]])), doc = doc_id)
  })

  index <- dplyr::bind_rows(entries) %>%
    dplyr::group_by(.data$token) %>%
    dplyr::summarize(docs = list(sort(.data$doc)),
                     n_docs = dplyr::n(),
                     .groups = "drop") %>%
    dplyr::filter(.data$n_docs >= min_doc_count,
                  .data$n_docs <= max_doc_count) %>%
    dplyr::arrange(.data$token)

  class(index) <- c("textreuse_token_index", class(index))

  index
}

#' Extract candidate document pairs from a token index
#'
#' @param index A \code{textreuse_token_index} object returned by
#'   \code{\link{token_index}}.
#' @return A \code{textreuse_candidates} data frame.
#' @export
token_index_candidates <- function(index) {
  assert_that(inherits(index, "textreuse_token_index"))

  pair_matrices <- lapply(index$docs, function(doc_ids) {
    if (length(doc_ids) < 2) return(NULL)
    t(utils::combn(doc_ids, 2))
  })
  pair_matrices <- Filter(Negate(is.null), pair_matrices)

  if (length(pair_matrices) == 0) {
    candidates <- tibble::tibble(a = character(), b = character(),
                                 score = numeric())
  } else {
    candidates <- do.call(rbind, pair_matrices) %>%
      as.data.frame(stringsAsFactors = FALSE) %>%
      tibble::as_tibble() %>%
      stats::setNames(c("a", "b")) %>%
      dplyr::mutate(a = pmin(.data$a, .data$b),
                    b = pmax(.data$a, .data$b)) %>%
      dplyr::distinct(.data$a, .data$b) %>%
      dplyr::arrange(.data$a, .data$b) %>%
      dplyr::mutate(score = NA_real_)
  }

  class(candidates) <- c("textreuse_candidates", class(candidates))

  candidates
}


================================================
FILE: R/tokenize.R
================================================
#' Recompute the tokens for a document or corpus
#'
#' Given a \code{\link{TextReuseTextDocument}} or a
#' \code{\link{TextReuseCorpus}}, this function recomputes the tokens and hashes
#' with the functions specified. Optionally, it can also recompute the minhash signatures.
#'
#' @param x A \code{\link{TextReuseTextDocument}} or
#'   \code{\link{TextReuseCorpus}}.
#' @param tokenizer A function to split the text into tokens. See
#'   \code{\link{tokenizers}}.
#' @param ... Arguments passed on to the \code{tokenizer}.
#' @param hash_func A function to hash the tokens. See
#'   \code{\link{hash_string}}.
#' @param minhash_func A function to create minhash signatures. See
#'   \code{\link{minhash_generator}}.
#' @param keep_tokens Should the tokens be saved in the document that is
#'   returned or discarded?
#' @param keep_text Should the text be saved in the document that is returned or
#'   discarded?
#'
#' @return The modified \code{\link{TextReuseTextDocument}} or
#'   \code{\link{TextReuseCorpus}}.
#'
#' @examples
#' dir <- system.file("extdata/legal", package = "textreuse")
#' corpus <- TextReuseCorpus(dir = dir, tokenizer = NULL)
#' corpus <- tokenize(corpus, tokenize_ngrams)
#' head(tokens(corpus[[1]]))
#' @export
tokenize <- function(x, tokenizer, ..., hash_func = hash_string,
                     minhash_func = NULL, keep_tokens = FALSE,
                     keep_text = TRUE) {
  UseMethod("tokenize", x)
}

#' @export
tokenize.TextReuseTextDocument <- function(x, tokenizer, ...,
                                           hash_func = hash_string,
                                           minhash_func = NULL,
                                           keep_tokens = TRUE,
                                           keep_text = TRUE) {
  assert_that(has_content(x),
              is.function(tokenizer),
              is.function(hash_func))
  x$tokens <- tokenizer(x$content, ...)
  x$hashes <- hash_func(x$tokens)
  if (!keep_tokens) x$tokens <- NULL
  if (!keep_text) x$text <- NULL
  x$meta$tokenizer <- as.character(substitute(tokenizer))
  x$meta$hash_func <- as.character(substitute(hash_func))
  if (!is.null(minhash_func)) {
    x$minhash <- minhash_func(x$tokens)
    x$meta$minhash_func <- as.character(substitute(minhash_func))
  } else {
    # If tokens are redone, minhashes are invalid, so delete them if they are
    # not also recomputed.
    x$minhashes <- NULL
    x$meta$minhash_func <- NULL
  }
  x
}

#' @export
tokenize.TextReuseCorpus <- function(x, tokenizer, ..., hash_func = hash_string,
                                     minhash_func = NULL, keep_tokens = TRUE,
                                     keep_text = TRUE) {
  apply_func <- get_apply_function()
  x$documents <- apply_func(x$documents, tokenize, tokenizer, ...,
                            hash_func = hash_func, minhash_func = minhash_func,
                            keep_tokens = keep_tokens, keep_text = keep_text)

  x$meta$tokenizer <- as.character(substitute(tokenizer))
  x$meta$hash_func <- as.character(substitute(hash_func))
  if (!is.null(minhash_func)) {
    x$meta$minhash_func <- as.character(substitute(minhash_func))
  } else {
    x$meta$minhash_func <- NULL
  }

  x

}


================================================
FILE: R/tokenizers.R
================================================
#' Split texts into tokens
#'
#' These functions each turn a text into tokens. The \code{tokenize_ngrams}
#' functions returns shingled n-grams.
#'
#' @name tokenizers
#' @param string A character vector of length 1 to be tokenized.
#' @param lowercase Should the tokens be made lower case?
#' @param n For n-gram tokenizers, the number of words in each n-gram.
#' @param k For the skip n-gram tokenizer, the maximum skip distance between
#'   words. The function will compute all skip n-grams between \code{0} and
#'   \code{k}.
#' @details These functions will strip all punctuation.
#' @return A character vector containing the tokens.
#' @examples
#' dylan <- "How many roads must a man walk down? The answer is blowin' in the wind."
#' tokenize_words(dylan)
#' tokenize_sentences(dylan)
#' tokenize_ngrams(dylan, n = 2)
#' tokenize_skip_ngrams(dylan, n = 3, k = 2)
NULL

#' @export
#' @rdname tokenizers
tokenize_words <- function(string, lowercase = TRUE) {
  assert_that(assertthat::is.string(string))
  out <- str_split(string, boundary("word"))[[1]]
  if (lowercase) str_to_lower(out) else out
}

#' @export
#' @rdname tokenizers
tokenize_sentences <- function(string, lowercase = TRUE) {
  assert_that(assertthat::is.string(string))
  out <- str_split(string, boundary("sentence", skip_word_none = FALSE))[[1]]
  out <- str_replace_all(out, "[[:punct:]]", " ")
  out <- str_replace_all(out, "\\s+", " ")
  out <- str_trim(out)
  if (lowercase) str_to_lower(out) else out
}

#' @export
#' @rdname tokenizers
tokenize_ngrams <- function(string, lowercase = TRUE, n = 3) {
  assert_that(is.count(n),
              assertthat::is.string(string))
  words <- tokenize_words(string, lowercase = lowercase)
  assert_that(n < length(words))
  shingle_ngrams(words, n = n)
}

#' @export
#' @rdname tokenizers
tokenize_skip_ngrams <- function(string, lowercase = TRUE, n = 3, k = 1) {
  assert_that(is.count(n),
              is.count(k) | k == 0,
              assertthat::is.string(string))
  words <- tokenize_words(string, lowercase = lowercase)
  assert_that(n + n * k - k <= length(words))
  skip_ngrams(words, n = n, k = k)
}


================================================
FILE: R/utils.R
================================================
# Take results of readLines and turn it into a character vector of length 1
as_string <- function(x) {
  x %>%
    str_c(collapse = "\n") %>%
    NLP::as.String()
}

# Pretty print the metadata for a document
pretty_print_metadata <- function(doc) {
  lapply(names(doc$meta), function(x) cat(x, ":", doc$meta[[x]], "\n"))
}

# Check whether the number of minhashes is evenly divisible by number of bands
check_banding <- function(h, b) {
  h %% b == 0
}

assertthat::on_failure(check_banding) <- function(call, env) {
  "The number of hashes must be evenly divisible by the number of bands."
}

# Sequences for subsetting by bands in minhash
band_seq <- function(l, b) {
  assert_that(check_banding(l, b))
  r <- l / b
  starts <- seq.int(from = 1, to = l, by = r)
  lapply(starts, function(n) seq.int(n, n + r - 1, 1))
}

# Test that meta exists and that it has an ID value
has_id <- function(meta) {
  !is.null(meta$id)
}

assertthat::on_failure(has_id) <- function(call, env) {
  paste("When creating a document from a string instead of a file, the `id`",
        "field in the metadata list must be specified.")
}

# People might row_bind() two of these data frames, so we can't rely just on
# the class.
is_lsh_buckets <- function(x) {
  identical(names(x), c("doc", "buckets")) & inherits(x, "data.frame")
}

assertthat::on_failure(is_lsh_buckets) <- function(call, env) {
  "Object is not a data frame of LSH buckets."
}

# People might run a candidates data frame through dplyr so that it loses its
# class.
is_candidates_df <- function(x) {
  class_check <- inherits(x, "textreuse_candidates")
  col_check <- all(c("a", "b", "score") %in% names(x)) & inherits(x, "data.frame")
  class_check | col_check
}

assertthat::on_failure(is_candidates_df) <- function(call, env) {
  "Object is not a candidates data frame."
}

is_integer_like <- function(x) {
  is.integer(x) | (is.scalar(x) & (x == as.integer(x)))
}

assertthat::on_failure(is_integer_like) <- function(call, env) {
 paste0(deparse(call$x), " is not a whole number.")
}

sort_meta <- function(meta) {
  meta[order(names(meta))]
}

sort_df_by_rows <- function(df) {
  assert_that(all(c("a", "b") %in% colnames(df)),
              is.data.frame(df))
  for (i in seq_len(nrow(df))) {
    ordered <- sort(c(df[i, "a"], df[i, "b"]))
    df[i, "a"] <- ordered[1]
    df[i, "b"] <- ordered[2]
  }
  df
}

sort_df_by_columns <- function(df) {
  assert_that(all(c("a", "b") %in% colnames(df)),
              is.data.frame(df))
  df <- df[with(df, order(a, b)), ]
  # rownames(df) <- NULL
  df
}

# Given a word, create a string with the same number of marker characters
mark_chars <- function(word, char) {
  str_c(rep(char, str_length(word)), collapse = "")
}


================================================
FILE: R/wordcount.R
================================================
#' Count words
#'
#' This function counts words in a text, for example, a character vector, a
#' \code{\link{TextReuseTextDocument}}, some other object that inherits from
#' \code{\link[NLP]{TextDocument}}, or a all the documents in a
#' \code{\link{TextReuseCorpus}}.
#'
#' @param x The object containing a text.
#' @export
#' @return An integer vector for the word count.
wordcount <- function(x) UseMethod("wordcount", x)

#' @export
wordcount.default <- function(x) {
  assert_that(is.string(x))
  str_count(x, boundary("word"))
}

#' @export
wordcount.TextDocument <- function(x) wordcount(x$content)

#' @export
wordcount.TextReuseCorpus <- function(x) {
  vapply(x$documents, wordcount, integer(1))
}


================================================
FILE: README.Rmd
================================================
---
output: md_document
title: Detect Text Reuse and Document Similarity
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, echo = FALSE, warning = FALSE, message = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = FALSE,
  fig.path = "README-"
)
suppressWarnings(suppressPackageStartupMessages(library(dplyr)))
```

# textreuse

[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/textreuse)](https://cran.r-project.org/package=textreuse)
[![CRAN_Downloads](https://cranlogs.r-pkg.org/badges/grand-total/textreuse)](https://cran.r-project.org/package=textreuse)
[![Coverage Status](https://img.shields.io/codecov/c/github/ropensci/textreuse/master.svg)](https://app.codecov.io/github/ropensci/textreuse?branch=master)
[![rOpenSci badge](https://badges.ropensci.org/20_status.svg)](https://github.com/ropensci/software-review/issues/20)

## Overview

This [R](https://www.r-project.org/) package provides a set of functions for measuring similarity among documents and detecting passages which have been reused. It implements shingled n-gram, skip n-gram, and other tokenizers; similarity/dissimilarity functions; pairwise comparisons; minhash and locality sensitive hashing algorithms; and a version of the Smith-Waterman local alignment algorithm suitable for natural language. It is broadly useful for, for example, detecting duplicate documents in a corpus prior to text analysis, or for identifying borrowed passages between texts. The classes provided by this package follow the model of other natural language processing packages for R, especially the [NLP](https://cran.r-project.org/package=NLP) and [tm](https://cran.r-project.org/package=tm) packages. (However, this package has no dependency on Java, which should make it easier to install.)

### Citation

If you use this package for scholarly research, I would appreciate a citation.

```{r}
citation("textreuse")
```

## Installation

To install this package from CRAN:

```{r eval=FALSE}
install.packages("textreuse")
```

To install the development version from GitHub, use [devtools](https://github.com/r-lib/devtools).  

```{r eval=FALSE}
# install.packages("devtools")
devtools::install_github("ropensci/textreuse", build_vignettes = TRUE)
```

## Examples

There are three main approaches that one may take when using this package: pairwise comparisons, minhashing/locality sensitive hashing, and extracting matching passages through text alignment.

See the [introductory vignette](https://docs.ropensci.org/textreuse/articles/textreuse-introduction.html) for a description of the classes provided by this package.

```{r eval = FALSE}
vignette("textreuse-introduction", package = "textreuse")
```

### Pairwise comparisons

In this example we will load a tiny corpus of three documents. These documents are drawn from Kellen Funk's [research](https://kellenfunk.org/field-code/) into the propagation of legal codes of civil procedure in the nineteenth-century United States.

```{r}
library(textreuse)
dir <- system.file("extdata/legal", package = "textreuse")
corpus <- TextReuseCorpus(dir = dir, meta = list(title = "Civil procedure"),
                          tokenizer = tokenize_ngrams, n = 7)
```

We have loaded the three documents into a corpus, which involves tokenizing the text and hashing the tokens. We can inspect the corpus as a whole or the individual documents that make it up.

```{r}
corpus
names(corpus)
corpus[["ca1851-match"]]
```

Now we can compare each of the documents to one another. The `pairwise_compare()` function applies a comparison function (in this case, `jaccard_similarity()`) to every pair of documents. The result is a matrix of scores. As we would expect, some documents are similar and others are not.

```{r}
comparisons <- pairwise_compare(corpus, jaccard_similarity)
comparisons
```

We can convert that matrix to a data frame of pairs and scores if we prefer.

```{r}
pairwise_candidates(comparisons)
```

See the [pairwise vignette](https://docs.ropensci.org/textreuse/articles/textreuse-pairwise.html) for a fuller description.

```{r eval=FALSE}
vignette("textreuse-pairwise", package = "textreuse")
```

### Minhashing and locality sensitive hashing

Pairwise comparisons can be very time-consuming because they grow geometrically with the size of the corpus. (A corpus with 10 documents would require at least 45 comparisons; a corpus with 100 documents would require 4,950 comparisons; a corpus with 1,000 documents would require 499,500 comparisons.) That's why this package implements the minhash and locality sensitive hashing algorithms, which can detect candidate pairs much faster than pairwise comparisons in corpora of any significant size. 

For this example we will load a small corpus of ten documents published by the American Tract Society. We will also create a minhash function, which represents an entire document (regardless of length) by a fixed number of integer hashes. When we create the corpus, the documents will each have a minhash signature.

```{r}
dir <- system.file("extdata/ats", package = "textreuse")
minhash <- minhash_generator(200, seed = 235)
ats <- TextReuseCorpus(dir = dir,
                       tokenizer = tokenize_ngrams, n = 5,
                       minhash_func = minhash)
```

Now we can calculate potential matches, extract the candidates, and apply a comparison function to just those candidates.

```{r}
buckets <- lsh(ats, bands = 50, progress = FALSE)
candidates <- lsh_candidates(buckets)
scores <- lsh_compare(candidates, ats, jaccard_similarity, progress = FALSE)
scores
```

For details, see the [minhash vignette](https://docs.ropensci.org/textreuse/articles/textreuse-minhash.html).

```{r eval=FALSE}
vignette("textreuse-minhash", package = "textreuse")
```

### Text alignment

We can also extract the optimal alignment between two documents with a version of the  [Smith-Waterman](https://en.wikipedia.org/wiki/Smith-Waterman_algorithm) algorithm, used for protein sequence alignment, adapted for natural language. The longest matching substring according to scoring values will be extracted, and variations in the alignment will be marked.

```{r}
a <- "'How do I know', she asked, 'if this is a good match?'"
b <- "'This is a match', he replied."
align_local(a, b)
```

For details, see the [text alignment vignette](https://docs.ropensci.org/textreuse/articles/textreuse-alignment.html).

```{r eval=FALSE}
vignette("textreuse-alignment", package = "textreuse")
```

### Parallel processing

Loading the corpus and creating tokens benefit from using multiple cores, if available. (This works only on non-Windows machines.) To use multiple cores, set `options("mc.cores" = 4L)`, where the number is how many cores you wish to use.

### Contributing and acknowledgments

Please note that this project is released with a [Contributor Code of Conduct](https://github.com/ropensci/textreuse/blob/master/CONDUCT.md). By participating in this project you agree to abide by its terms.

Thanks to [Noam Ross](https://www.noamross.net/) for his thorough [peer review](https://github.com/ropensci/software-review/issues/20) of this package for [rOpenSci](https://ropensci.org/).

------------------------------------------------------------------------

[![rOpenSCi logo](https://ropensci.org/public_images/github_footer.png)](https://ropensci.org)


================================================
FILE: README.md
================================================
<!-- README.md is generated from README.Rmd. Please edit that file -->

# textreuse

[![CRAN\_Status\_Badge](https://www.r-pkg.org/badges/version/textreuse)](https://cran.r-project.org/package=textreuse)
[![CRAN\_Downloads](https://cranlogs.r-pkg.org/badges/grand-total/textreuse)](https://cran.r-project.org/package=textreuse)
[![Coverage
Status](https://img.shields.io/codecov/c/github/ropensci/textreuse/master.svg)](https://app.codecov.io/github/ropensci/textreuse?branch=master)
[![rOpenSci
badge](https://badges.ropensci.org/20_status.svg)](https://github.com/ropensci/software-review/issues/20)

## Overview

This [R](https://www.r-project.org/) package provides a set of functions
for measuring similarity among documents and detecting passages which
have been reused. It implements shingled n-gram, skip n-gram, and other
tokenizers; similarity/dissimilarity functions; pairwise comparisons;
minhash and locality sensitive hashing algorithms; and a version of the
Smith-Waterman local alignment algorithm suitable for natural language.
It is broadly useful for, for example, detecting duplicate documents in
a corpus prior to text analysis, or for identifying borrowed passages
between texts. The classes provided by this package follow the model of
other natural language processing packages for R, especially the
[NLP](https://cran.r-project.org/package=NLP) and
[tm](https://cran.r-project.org/package=tm) packages. (However, this
package has no dependency on Java, which should make it easier to
install.)

### Citation

If you use this package for scholarly research, I would appreciate a
citation.

    citation("textreuse")
    #> To cite package 'textreuse' in publications use:
    #> 
    #>   Mullen L, Li Y (2026). _textreuse: Detect Text Reuse and Document
    #>   Similarity_. R package version 1.0.1,
    #>   https://github.com/ropensci/textreuse,
    #>   <https://docs.ropensci.org/textreuse/>.
    #> 
    #> A BibTeX entry for LaTeX users is
    #> 
    #>   @Manual{,
    #>     title = {textreuse: Detect Text Reuse and Document Similarity},
    #>     author = {Lincoln Mullen and Yaoxiang Li},
    #>     year = {2026},
    #>     note = {R package version 1.0.1, 
    #> https://github.com/ropensci/textreuse},
    #>     url = {https://docs.ropensci.org/textreuse/},
    #>   }

## Installation

To install this package from CRAN:

    install.packages("textreuse")

To install the development version from GitHub, use
[devtools](https://github.com/r-lib/devtools).

    # install.packages("devtools")
    devtools::install_github("ropensci/textreuse", build_vignettes = TRUE)

## Examples

There are three main approaches that one may take when using this
package: pairwise comparisons, minhashing/locality sensitive hashing,
and extracting matching passages through text alignment.

See the [introductory
vignette](https://docs.ropensci.org/textreuse/articles/textreuse-introduction.html)
for a description of the classes provided by this package.

    vignette("textreuse-introduction", package = "textreuse")

### Pairwise comparisons

In this example we will load a tiny corpus of three documents. These
documents are drawn from Kellen Funk’s
[research](https://kellenfunk.org/field-code/) into the propagation of
legal codes of civil procedure in the nineteenth-century United States.

    library(textreuse)
    dir <- system.file("extdata/legal", package = "textreuse")
    corpus <- TextReuseCorpus(dir = dir, meta = list(title = "Civil procedure"),
                              tokenizer = tokenize_ngrams, n = 7)

We have loaded the three documents into a corpus, which involves
tokenizing the text and hashing the tokens. We can inspect the corpus as
a whole or the individual documents that make it up.

    corpus
    #> TextReuseCorpus
    #> Number of documents: 3 
    #> hash_func : hash_string 
    #> title : Civil procedure 
    #> tokenizer : tokenize_ngrams
    names(corpus)
    #> [1] "ca1851-match"   "ca1851-nomatch" "ny1850-match"
    corpus[["ca1851-match"]]
    #> TextReuseTextDocument
    #> file : C:/Users/Bach/AppData/Local/R/win-library/4.4/textreuse/extdata/legal/ca1851-match.txt 
    #> hash_func : hash_string 
    #> id : ca1851-match 
    #> minhash_func : 
    #> tokenizer : tokenize_ngrams 
    #> content : § 4. Every action shall be prosecuted in the name of the real party
    #> in interest, except as otherwise provided in this Act.
    #> 
    #> § 5. In the case of an assignment of a thing in action, the action by
    #> the as

Now we can compare each of the documents to one another. The
`pairwise_compare()` function applies a comparison function (in this
case, `jaccard_similarity()`) to every pair of documents. The result is
a matrix of scores. As we would expect, some documents are similar and
others are not.

    comparisons <- pairwise_compare(corpus, jaccard_similarity)
    comparisons
    #>                ca1851-match ca1851-nomatch ny1850-match
    #> ca1851-match             NA              0    0.3842549
    #> ca1851-nomatch           NA             NA    0.0000000
    #> ny1850-match             NA             NA           NA

We can convert that matrix to a data frame of pairs and scores if we
prefer.

    pairwise_candidates(comparisons)
    #> # A tibble: 3 × 3
    #>   a              b              score
    #> * <chr>          <chr>          <dbl>
    #> 1 ca1851-match   ca1851-nomatch 0    
    #> 2 ca1851-match   ny1850-match   0.384
    #> 3 ca1851-nomatch ny1850-match   0

See the [pairwise
vignette](https://docs.ropensci.org/textreuse/articles/textreuse-pairwise.html)
for a fuller description.

    vignette("textreuse-pairwise", package = "textreuse")

### Minhashing and locality sensitive hashing

Pairwise comparisons can be very time-consuming because they grow
geometrically with the size of the corpus. (A corpus with 10 documents
would require at least 45 comparisons; a corpus with 100 documents would
require 4,950 comparisons; a corpus with 1,000 documents would require
499,500 comparisons.) That’s why this package implements the minhash and
locality sensitive hashing algorithms, which can detect candidate pairs
much faster than pairwise comparisons in corpora of any significant
size.

For this example we will load a small corpus of ten documents published
by the American Tract Society. We will also create a minhash function,
which represents an entire document (regardless of length) by a fixed
number of integer hashes. When we create the corpus, the documents will
each have a minhash signature.

    dir <- system.file("extdata/ats", package = "textreuse")
    minhash <- minhash_generator(200, seed = 235)
    ats <- TextReuseCorpus(dir = dir,
                           tokenizer = tokenize_ngrams, n = 5,
                           minhash_func = minhash)

Now we can calculate potential matches, extract the candidates, and
apply a comparison function to just those candidates.

    buckets <- lsh(ats, bands = 50, progress = FALSE)
    candidates <- lsh_candidates(buckets)
    scores <- lsh_compare(candidates, ats, jaccard_similarity, progress = FALSE)
    scores
    #> # A tibble: 1 × 3
    #>   a              b                      score
    #>   <chr>          <chr>                  <dbl>
    #> 1 remember00palm remembermeorholy00palm 0.701

For details, see the [minhash
vignette](https://docs.ropensci.org/textreuse/articles/textreuse-minhash.html).

    vignette("textreuse-minhash", package = "textreuse")

### Text alignment

We can also extract the optimal alignment between two documents with a
version of the
[Smith-Waterman](https://en.wikipedia.org/wiki/Smith-Waterman_algorithm)
algorithm, used for protein sequence alignment, adapted for natural
language. The longest matching substring according to scoring values
will be extracted, and variations in the alignment will be marked.

    a <- "'How do I know', she asked, 'if this is a good match?'"
    b <- "'This is a match', he replied."
    align_local(a, b)
    #> TextReuse alignment
    #> Alignment score: 7 
    #> Document A:
    #> this is a good match
    #> 
    #> Document B:
    #> This is a #### match

For details, see the [text alignment
vignette](https://docs.ropensci.org/textreuse/articles/textreuse-alignment.html).

    vignette("textreuse-alignment", package = "textreuse")

### Parallel processing

Loading the corpus and creating tokens benefit from using multiple
cores, if available. (This works only on non-Windows machines.) To use
multiple cores, set `options("mc.cores" = 4L)`, where the number is how
many cores you wish to use.

### Contributing and acknowledgments

Please note that this project is released with a [Contributor Code of
Conduct](https://github.com/ropensci/textreuse/blob/master/CONDUCT.md).
By participating in this project you agree to abide by its terms.

Thanks to [Noam Ross](https://www.noamross.net/) for his thorough [peer
review](https://github.com/ropensci/software-review/issues/20) of this
package for [rOpenSci](https://ropensci.org/).

------------------------------------------------------------------------

[![rOpenSCi
logo](https://ropensci.org/public_images/github_footer.png)](https://ropensci.org)


================================================
FILE: _pkgdown.yml
================================================
url: https://docs.ropensci.org/textreuse/

template:
  bootstrap: 5
  bootswatch: united

authors:
  Yaoxiang Li:
    href: "https://github.com/yaoxiangli"


================================================
FILE: appveyor.yml
================================================
# DO NOT CHANGE the "init" and "install" sections below

# Download script file from GitHub
init:
  ps: |
        $ErrorActionPreference = "Stop"
        Invoke-WebRequest http://raw.github.com/krlmlr/r-appveyor/master/scripts/appveyor-tool.ps1 -OutFile "..\appveyor-tool.ps1"
        Import-Module '..\appveyor-tool.ps1'

install:
  ps: Bootstrap

# Adapt as necessary starting from here

build: off

build_script:
  - travis-tool.sh install_deps

test_script:
  - travis-tool.sh run_tests

on_failure:
  - travis-tool.sh dump_logs

artifacts:
  - path: '*.Rcheck\**\*.log'
    name: Logs

  - path: '*.Rcheck\**\*.out'
    name: Logs

  - path: '*.Rcheck\**\*.fail'
    name: Logs

  - path: '*.Rcheck\**\*.Rout'
    name: Logs

  - path: '\*_*.tar.gz'
    name: Bits

  - path: '\*_*.zip'
    name: Bits


================================================
FILE: cran-comments.md
================================================
This is a new release with bug fixes, documentation refreshes, and helper
functions added after a long maintenance interval.

This resubmission fixes a moved URL in README.md:
https://github.com/hadley/devtools was replaced with
https://github.com/r-lib/devtools.

This is also a maintainer change release. The maintainer has changed from
Lincoln Mullen <lincoln@lincolnmullen.com> to Yaoxiang Li
<liyaoxiang@outlook.com>. The previous maintainer, Lincoln Mullen, has
confirmed by email that he supports the maintainer transition. I can provide
the email thread if requested.

## Test environments

* local Windows 11 install: R 4.4.2

## R CMD check results

There were no ERRORs or WARNINGs.

Local checks were run with:

`R CMD check --no-manual textreuse_1.0.1.tar.gz`

`R CMD check --as-cran --no-manual textreuse_1.0.1.tar.gz`

The `--as-cran` check reported three NOTEs:

* This release changes the maintainer from Lincoln Mullen to Yaoxiang Li.
* The local Windows check was unable to verify the current time.
* The local Windows check reported that README.md or NEWS.md could not be
  checked without pandoc. README.md was regenerated locally with rmarkdown,
  and the pkgdown site was built locally with RStudio Pandoc before release.

There were no invalid URL NOTEs.

## Downstream dependencies

There are no known downstream dependency issues.


================================================
FILE: inst/extdata/ats/calltounconv00baxt.txt
================================================

Glass. 



Book,____._._ 



(ttmmtixc 



mm¥m %m(m>m 



v'OJj 




A 

CALL 

TO 

THE TTNCONVEETS9. 



BY REV. RICHARD BAXTER. 



AN INTRODUCTORY ESSAY, 

BY RCV. THOMAS CHALMERS, D. D. 



PDBUSHED BY THE 

AMERICAN TRACT SOCIETY, 

150 NASSAU-STREET, KEWYORK. 



D. Fanshaw, Printer. 



DR. CHALMERS' 

|f INTRODUCTORY ESSAY, 

ABRIDGED. 



The " Call to the Unconverted " by Richard Bax • 
ter, is characterized by all that solemn earnestness, 
and urgency of appeal, for winch the writings of this 
much-admired author are so peculiarly distinguished. 
He seems to look upon mankind solely with the eyes 
of the Spirit, and exclusively to recognize them in 
their spiritual relations, and in the great and essential 
elements of their immortal being. Their future des- 
tiny is the all-important concern which fills and en- 
grosses his mind, and he regards nothing of any mag- 
nitude but what has a distinct bearing on their spiri- 
tual and eternal condition. His business, therefore, is 
always with the conscience, to which he makes the 
most forcible appeals, ami which he plies with all 
those arguments which are fitted to awaken the sinner 
to a deep sense of the necessity and importance of im- 
mediate repentance. He endeavors to move him by 
the most touching of all representations, the tender- 
ness of a beseeching God waiting to be gracious, and 
not willing that any should perish ; and while he em- 
ploys every form of entreaty, which tenderness and 
compassion can suggest, to allure the sinner to "turn 
and live," he does not shrink from forcing on his con- 
victions those considerations which are fitted to alarm 
his fears, the terrors of the Lord, and the wrath, not 
merely of an offended Lawgiver, but of a God of love, 
whose threatemngs he disregards, whose grace he des* 



4 INTRODUCTION. 

pises, and whose mercy he rejects. And aware of the 
deceitfulness of sin in hardening the heart, and in be- 
traying the sinner into a neglect of his spiritual inte- 
rests, he divests him of every refuge, and strips him of 
every plea for postponing his preparation for eternity. 
He forcibly exposes the delusion of convenient seasons, 
and the awful infatuation and hazard of delay, and 
Knowing the magnitude of the stake at issue, he urges 
the sinner to immediate repentance, as if the fearful 
and almost absolute alternative were "Now or Never." 
And to secure the commencement of such an important 
work against all the dangers to which procrastination 
might expose it, he endeavors to arrest the sinner in 
his career of guilt and unconcern, and resolutely to fix 
his determination on " turning to God this day with- 
out delay." 

There are two very prevalent delusions on this sub- 
ject, which we should like to expose ; the one regards 
the nature, and the other the season of repentance; 
both of which are pregnant with mischief to the minds 
of men. With regard to the first, much mischief has 
arisen from mistakes respecting the meaning of tho 
tenn repentance. The word repentance occurs with 
two different meanings in the New Testament ; and 
it is to be regretted, that two different words could not 
have been devised to express these. This is charge- 
able upon the poverty of our language; for it is to be 
observed, that in the original Greek the distinction in 
the meanings is pointed out by a distinction in tho 
words. The employment of one term to denote two 
different things has the effect of confounding and mis- 
leading the understanding ; and it is much to be 
wished, that every ambiguity of this kind were clear- 
ed away from that most interesting point in the pro- 



INTRODUCTION. 5 

cess of a human soul, ax which it turns from sin unto 
righteousness, and from the power of Satan unto God. 

When in common language, a man says, " I repent 
of such an action," he is understood to say, " I am sorry 
for having done it." The feeling is familiar to all of 
us. How often does the man of dissipation prove this 
sense of the word repentance, when he awakes in the 
monung, and, oppressed by the languor of ins ex- 
hausted faculties, looks back with remorse on the fol- 
lies and profligacies of the night that is past? How 
often does the man of unguarded conversation prove it, 
when he thinks of the friends whose feelings he has 
wounded by some hasty utterance which he cannot 
recall 1 How often is it proved by the man of business, 
when he reflects on the rash engagement which ties 
him down to a losing speculation? All these people 
would be perfectly understood wiien they say, " We 
repent of these doings." The word repentance so 
applied is about equivalent to the word regret. There 
are several passages in the New Testament where 
this is the undoubted sense of the word repentance. 
In Matt. 27: 3. the wretched Judas repented himself 
of his treachery ; and surely, wlien we think of the 
awful denunciation uttered by our Savior against the 
man who should betray him, that it were better for 
him if he had not been born, we shall never confound 
the repentance which Judas experienced with that 
repentance winch is unto salvation. 

Now here lies the danger to practical Christianity. 
In the above-cited passage, to repent is just to regret, 
or to be sorry for ; and tins we conceive .o be by fai 
the most prevailing sense of the term in the English 
language. But there are other places where the same 
term is employed to denote that which is urged upon 



6 INTRODUCTION. 

us as a duty — that which is preached for the remis- 
sion of sins — that which is so indispensable to sinners, 
as to call forth the declaration from our Savior, thai 
unless we have it, we shall all likewise perish. Now, 
though repentance, in all these cases, is expressed by 
the same term in our translation as the repentance ot 
mere regret, it is expressed by a different term in the 
original record of our faith. This surely might lead 
us to suspect a difference of meaning, and should cau- 
tion us against taking up with that, as sufficient for 
the business of our salvation, which is short of saving 
and scriptural repentance. There may be an alterna- 
tion of wilful sin, and of deep-felt sorrow, up to th© 
very end of our history — there may be a presumptu- 
ous sin committed every day, and a sorrow regularly 
succeeding it. Sorrow may imbitter every act of sin — 
sorrow may darken every interval of sinful indul- 
gence — and sorrow may give an unutterable anguish 
to the pains and the prospects of a deathbed. Couple 
all this with the circumstance that sorrow passes, in 
the common currency of our language, for repentance , 
and that repentance is made, by our Bible, to lie at 
the turning point from a state of condemnation to a 
state of acceptance with God; and it is difficult not to 
conceive thai much danger may have arisen from this, 
leading to indistinct views of the nature of repentance, 
and to slender and superficial conceptions of the migh- 
ty change which is implied in it. 

We are far from saying that the eye of Christiana 
is not open to this danger — and that the vigilant care 
of Christian authors has not been employed in avert- 
ing it. Where will we get a better definition of re- 
pentance unto life than in our Shorter Catechism? by 
which the sinner is represented not merely as grieving. 



INTRODUCTION. 7 

but, along with his grief and hatred of sin, aa turning 
from it unto God with full purpose of, and endeavor 
after new obedience. But the mischief is, that the 
word repent has a common meaning, different from 
the theological ; that wherever it is used, this common 
meaning is apt to intrude itself, and exert a kind of 
habitual imposition upon the understanding — that the 
influence of the single word carries it over the influ- 
ence of the lengthened explanation — and thus it is 
that, for a steady progress in the obedience of the 
gospel, many persevere, to the end of their days, in a 
wretched course of sinning and cf sorrowing, without 
fruit and without amendment. 

To save the practically mischievous effect arising 
from the application of one term to two different things, 
one distinct and appropriate tenn has been suggested 
for the saving repentance of the New Testament. 
The term repentance itself has been restricted to the 
repentance of mere sorrow, and is made equivalent to 
regret ; and for the other, able translators have 
adopted the word reformation. The one is expressive 
of sorrow for our past conduct ; the other is expressive 
of our renouncing it. It denotes an actual turning 
from the habits of life that we are sorry for. Give us, 
say they, a change from bad deeds to good deeds, 
from bad habits to good habits, from a life of wicked- 
ness to a life of conformity to the requirements of 
heaven, and you give us reformation. 

Now there is often nothing more unprofitable than 
a dispute about words ; but if a word has got into com- 
mon use, a common and generally understood mean- 
ing is attached to it ; and if this meaning does not 
just come up to the thing which we want to express 
by it, the application of that word to that thing has 



3 INTRODUCTION- 

the same misleading effects as in the case already 
alluded to. Now, we have much the same kind of 
exception to allege against the term reformation, that 
we have alleged against the term repentance. The 
term repentance is inadequate — and why? because, 
in the common use of it, it is equivalent to regret, and 
regret is short of the saving change that is spoken of 
in the New Testament. On the very same principle, 
we count the term reformation to be inadequate. We 
think that, in common language, a man would receive 
the appellation of a reformed man upon the mere 
change of his outward habits, without any reference 
to the change of mind and of principle which gave 
rise to it. Let the drunkard give up his excesses — 
let the backbiter give up his evil speakings — let the 
extortioner give up his unfair charges — and we would 
apply to one and all of them, upon the mere change 
of their external doings, the character of reformed 
men. Now, it is evident that the drunkard may give 
up his drunkenness, because checked by a serious im- 
pression of the injury he has been doing to his health 
and his circumstances. The backbiter may give up 
his evil speaking, on being made to perceive that the 
hateful practice has brought upon him the contempt 
and alienation of his neighbors. The extortioner may 
give up his unfair charges, upon taking it into calcu- 
lation that his business is likely to suffer by the deser- 
tion of his customers. Now, it is evident, that though 
in each of these cases there has been what the world 
would call reformation, there has not been scriptural 
repentance. The deficiency of the former term con- 
gists in its having been employed to denote a mere 
change in the deeds or in the habits of the outward 
man • and if employed as equivalent to repentance, it 



INTRODUCTION. y 

may delude us into the idea that the change by which 
we are made meet for a happy eternity is a far more 
slender and superficial thing than it really is. It is 
of little importance to be told that the translator means 
it only in the sense of a reformed conduct, proceeding 
from the influence of a new and a right principle 
within. The common meaning of the word will, as 
in the former instance, be ever and anon intruding 
itself, and get the better of all the formal cautions, and 
all the qualifying clauses of our Bible commentators. 
But, will not the original word itself throw some 
light upon this important question? The repentance 
which is enjoined as a duty — the repentance which 
is unto salvation — the repentance which sinners un- 
dergo when they pass to a state of acceptance with 
God from a state of enmity against him — these are 
all one and the same thing, and are expressed by one 
and the same word in the original language of the 
New Testament. It is different from the word which 
expresses the repentance of sorrow ; and if translated 
according to the parts of which it is composed, it sig- 
nifies neither more nor less than a change of mind. 
This of itself is sufficient to prove the inadequacy ot 
the term reformation^a term which is often applied 
to a man upon the mere change of his conduct, with- 
out ever adverting to the state of his mind, or to the 
kind of change in motive and in principle which it 
has undergone. It is true, that there can be no change 
in the conduct without some change in the inward 
principle. A reformed drunkard, before careless about 
health or fortune, may be so far changed as to become 
impressed with these considerations; but this change 
is evidently short of that which the Bible calls repent- 
ance toward God. It is a change that may, and has 



10 INTRpDUCTION. 

taken place in many a mind, when there was no 
effectual sense of the God who is above us, and of the 
eternity which is before us. It is a change, brought 
about by the prospect and the calculation of worldly 
advantages ; and, in the enjoyment of these advan- 
tages it hath its sole reward. But it is not done unto 
God, and God will not accept of it as done unto him. 
Reformation may signify nothing more than the mere 
surface-dressing of those decencies, and proprieties, 
and accomplishments, and civil and prudential duties, 
which, however fitted to secure a man's acceptance 
in society, may, one and all of them, consist with a 
heart alienated from God, and having every principle 
and affection of the inner man away from him. True, 
it is such a change as the man will reap benefit from, 
as his friends will rejoice in, as the world will call 
reformation ; but it is not such a change as will make 
him meet for heaven; nor is it, in its import, what our 
Savior speaks of, when he says, " I tell you nay, ex- 
cept ye repent, ye shall all likewise perish." 

There is no single word in the English language 
which occurs to us as fully equal to the faithful ren- 
dering of the term in the original. Renewedness oj 
mind, however awkward a phrase this may be, is 
perhaps the most nearly expressive of it. Certain it 
is, that it harmonizes with those other passages of the 
Bible where the process is described by which saving 
repentance ie brought about. We read of being 
transformed by the renewing of our minds, of the re- 
newing of the Holy Ghost, of being renewed in the 
spirit of our minds. Scriptural repentance, therefore, 
is that deep and radical change whereby a said turns 
from the idcls of sin and of self unto God, and de- 
votes eve?y movement of the inner and Vie outer man 



INTRODUCTION. 11 

to ihe. captivity of his obedicwe. This is the change 
which, whether it be expressed by one word or not in 
the English language, we would have you well to 
understand ; and reformation or change in the out- 
ward conduct, instead of being saving and scriptural 
repentance, is what, in the language of John the 
Baptist, we would call a fruit meet for it. But if 
miscliief is likely to arise, from the want of an ade- 
quate word in our language, to that repentance which 
is unto salvation, there is one effectual preservative 
against it — a firm and consistent exhibition of the 
whole counsel and revelation of God. A man who is 
well read in his New Testament, and reads it with 
docility, will dismiss all his meagre conceptions of 
repentance when he comes to the following state- 
ments: — "Except a man be born again he cannot 
see the kingdom of God." " Except ye be converted, 
and become as little children, ye shall not enter into 
the kingdom of heaven." " If any man have not the 
Spirit of Christ he is none of his." " The carnal 
mind is enmity against God ; and if ye live after the 
flesh ye shall die; but if ye, through the Spirit, do 
mortify the deeds of the body, ye shall live." " Be not 
then conformed to this world, but be ye transformed 
by the renewing of your minds." Such are the terms 
employed to describe the process by which the soul 
of man is renewed unto repentance ; and, with your 
hearts familiarized to the mighty import of these 
terms, you will carry with you an effectual guarantee 
against those false and flimsy impressions, which are 
so current in the world, about the preparation of a 
sinner for eternity. ***** 

We should like, moreover, to reduce every man to 
the feeling of repentance now or the alternative of 



12 INTRODUCTION. 

repentance never. We should like to flash it upmi 
your convictions, that, by putting the call away from 
you now, you put your eternity away from you. We 
should like tc expose the whole amount of that accure 
ed infatuation which lies in delay. We should like to 
arouse every soul out of its lethargies, and give noquar* 
ter to the plea of a little more sleep, and a little more 
slumber. We should like you to feel as if the whole of 
your future destiny hinged on the very first movement 
to which you turned yourselves. The work of repent- 
ance must have a beginning; and we should like you 
to know that, if not begun to-day, the chance will be 
less of its being begun to-morrow. And if the greater 
chance has failed, what hope can we build upon the 
smaller?— and a chance to that is always getting 
smaller. Each day, as it revolves over the sinner's 
head, finds him a harder, and a more obstinate, ana 
a more helplessly enslaved sinner, than belbre. It 
was this consideration which gave Richard Baxter 
such earnestness and such urgency in his " Call." He 
knew that the barrier in the way of the sinner's return 
was strengthened by every act of resistance to the call 
which urges it. That the refusal of this moment 
hardened the man against the next attack of a Gos- 
pel argument that is brought to bear upon him. That 
-.f he attempted you now, and he failed, when he came 
back upon yoa he would find himself working on a 
more obstinate and uncomplying subject than ever. 
And therefore it is that he ever feels as if the present 
were his only opportunity. That he is now upon his 
vantage ground, and he gives every energy of his 
soul to the great point of making the most of it. He 
will put up with none of your evasions. He will 
consent to none of your postponements. He will pay 



INTRODUCTION. 13 

respect to none of your more convenient seasons. He 
tells you, that the matter with which he is charged 
lias all the urgency of a matter in hand. He speaka 
to you with as much earnestness as if he knew that 
you were going to step into eternity in half an hour. 
He delivers his message with as much solemnity as if 
he knew that tins was your last meeting on earth, 
and that you were never to see each other till you 
stood together at the judgment-seat. He knew that 
some mighty change must take place in you ere you 
be fit for entering into the presence of God ; and that 
the time in which, on every plea of duty and of inte 
rest, you should bestir yourselves to secure this, is the 
present time. This is the distinct point he assigns to 
himself; and the whole drift of his argument is to 
urge an instantaneous choice of the better part, by 
telling you how you multiply every day the obstacles 
to your future repentance, if you begin not the work 
of repentance now. 

Before bringing our Essay to a close we shall make 
some observations on the mistakes concerning repent- 
ance, which we have endeavored to expose, and ad- 
duce some arguments for urging on the consciences of 
our -readers tke necessity and importance of imme- 
diate repentance. 

1. The work of repentance is a work which must 
be done ere we die ; for, unless we repent, we shall all 
likewise perish. Now, the easier this work is in our 
conception, we shall think it the less necessary to enter 
upon it immediately. We shall leok upon it as a 
work that may be done at any time, and therefore put 
it off a little longer, and a little longer. We shall, 
perhaps, look forward to that retirement from the 
world and its temptations which we figure old age to 

Sax. Call, g 



14 1NTR0DDCTI0N. 

bring along with it, and falling in with the too com 
mon idea, that, the evening of life is the appropriate 
season of preparation for another world, we shall 
think that the author is bearing too closely and too 
urgently upon us, when, in the language of the Bible, 
he speaks of " to-day," while it is called to-day, and 
will let us off with no other repentance than repent- 
ance "now," seeing that now only is the accepted 
time, and now only the day of salvation, which he 
has a warrant to proclaim to us. This dilatory way 
of it is very much favored by the mistaken and very 
defective view of repentance which we have attempt- 
ed to expose. We have some how or other got into 
the delusion that repentance is nothing but sorrow; 
and were we called to fix upon the scene where this 
sorrow is likely to be felt in the degree that it is deep- 
est and most overwhelming, we would point to the 
chamber of the dying man. It is awful to think that, 
generally speaking, this repentance of mere sorrow is 
the only repentance of a death-bed. Yes ! we shall 
meet with sensibility deep enough and painful enough 
there — with regret in all its bitterness — with terror 
mustering up its images of despair, and dwelling 
upon them in all the gloom of an affrighted imagina- 
tion ; and this is mistaken, not merely for the drapery 
of repentance, but for the very substance of it. We 
look forward, and we count upon this — that the sins 
of a life are to be expunged by the sighing and sor 
rowing of the last days of it. We should give up this 
wretchedly superficial notion of repentance, a nd cease, 
from this moment, to be led astray by it. The mind 
may sorrow over its corruptions at the very time that 
it is under the poAver of them. A man may weep 
mast bitterly over the perversities of his moral consli- 



INTRODUCTION. 15 

uition; but to change that constitution, under the 
workings of the Holy Spirit, is a different affair. 
"Now, this is the mighty work of repentance. He who 
has undergone it is no longer the servant of sin. He 
dies unto sin, he lives unto God. A sense of the au- 
thority of God is ever present with him, to wield the 
ascendancy of a great master-principle over all his 
movements — to call forth every purpose, and to carry 
it forward, through all the opposition of sin and of 
Satan, into accomplishment. This is the grand revo- 
lution in the s£ate of the mind which repentance 
brings along with it. To grieve because this work is 
not done, is a very different thing from the doing of it. 
A deathbed is the very best scene for acting the first , 
but it is the very worst for acting the second. The re- 
pentance of Judas has often been acted there. We 
ought to think of the work in all its magnitude, and 
not to put it off' to that awful period when the soul is 
crowded with other things, and has to maintain its 
weary struggle with the pains, and the distresses, 
and the shiverings, and the breathless agonies cf a 
deathbed. 

2. There are two views that may be taken of the 
way in which repentance is brought about, and which- 
ever of them is adopted, delay carries along with it 
the saddest infatuation. It may be looked upon as 
a step taken by man as a voluntary agent, and we 
would ask you, upon your experience of the powers 
and the performances of humanity, if a deathbed is 
the time for taking such a step? Is this a time for a 
voluntary being exercising a vigorous control over his 
own movements? When racked with pain, and borne 
down by the pressure cf a sore and overwhelming 
calamity ? Surely the greater the work of repentance 



16 INTRODUCTION. 

is, the more ease, the more time, the more freedom 
from suffering, is necessary for carrying itonj and, 
therefore, addressing you as voluntary beings, as 
beings who will and who do, we call upon you to seek 
God early that you may find him— to haste, and make 
no delay in keeping his commandments. 

The other view is, that repentance is not a self- 
originating work in man, but the work of the Holy 
Spirit in him as the subject of its influences. This 
view is not opposite to the former. It is true that man 
wills and does at every step in the business of his sal- 
vation; and it is as true that God works in him so to 
will and to do. Take this last view of it then. Look 
on repentance as the work of God's Spirit in the soul 
of man, and we are furnished with a more impressive 
argument than ever, and set on higher vantage for 
urging you to stir yourselves, and set about it im- 
mediately. What is it that you propose ? To keep 
by your present habits, and your present indulgences, 
and build yourselves up all the while in the confidence 
that the Spirit will interpose with his mighty power 
of conversion upon you, at the very point of time that 
you have fixed upon as convenient and agreeable? 
And how do you conciliate the Spirit's answer to your 
call then? Why, by doing all you can to grieve, and 
to quench, and to provoke him to abandon you now. 
Do you feel a motion toward repentance at this mo- 
ment? If you keep it alive, and act upon it, good and 
well. But if you smother and suppress this motion, 
you resist the Spirit — you stifle his movements within 
you ; it is what the impenitent do day after day, and 
year after year — and is this the way for securing the 
influences of the Spirit at the time that you would 
like them best? When you are done with the world, 



TNTR0DUCT10N. 17 

and are looking forward to eternity because you can- 
not help it? God says, "My Spirit shall not always 
strive with man." A good and a free Spirit he un- 
doubtedly is, and, as a proof of it, he is now saying, 
"Let whosoever will, come and take of the water of 
life freely." He says so now, but we do not promise 
that he will say so with effect upon your deathbeds, 
if you refuse him now. You look forward then for a 
powerful work of conversion being done upon you, and 
yet you employ yourselves all your life long in raising 
and multiplying obstacles against it You count upon 
a miracle of grace before you die, and the way you 
take to make yourselves sure of it, is to grieve and 
offend him while you live, who alone can perform the 
miracle. O what cruel deceits will sin land us in ! 
and how artfully it pleads for a " little more sleep, and 
a little more slumber; a little more folding of the 
hands to sleep." We should hold out no longer, nor 
make such an abuse of the forbearance of God : we 
shall treasure up wrath against the day of wrath if 
we do so. The genuine effect of his geodness is to 
lead us to repentance ; let not its effect upon us be to 
harden and encourage ourselves in the ways of sin. 
We should cry now for the clean heart and the right 
spirit; and such is the exceeding freeness of the Spirit 
of Gcd, that we shall be listened to. If we put off the 
cry till then, the same God may laugh at our calam- 
ity, and mock when our fear cometh. 

3. Our next argument for immediate repentance is, 
that we cannot bring forward, at any future period o! 
your history, any considerations of a more prevailing 
or more powerfully moving influence than those we 
may bring forward at this moment. We can tell you 
now of the terrors cif the Lord, we can tell you now 
2* 



18 INTRODUCTION. 

of the solemn mandates which have issued from his 
throne — and the authority of which is upon one and 
all of you. We can tell you now, that, though, in 
this dead and darkened world, sin appears but a 
very, trivia' affair — for every body sins, and it is 
shielded from execration by the universal countenance 
of an entire species lying in wickedness — yet it holds 
true of God, what is so emphatically said of him, that 
he cannot be mocked, nor will he endure it that you 
should not in the impunity of your wilful resistance 
to him and to his warnings. We can tell you now, 
that he is a God of vengeance ; and though, for a 
season, he is keeping back all the thunder of it from a 
world that he would reclaim unto himself, yet, if you 
put all his expostulations away from you, and will not 
be reclaimed, these thunders will be let loose upon 
you, and they will fall on your guilty heads, armed 
with tenfold energy, because you have not only defied 
his threats, but turned your back on his offers of re- 
conciliation. These are the arguments by which we 
would try to open our way to your consciences, and to 
awaken up your fears, and to put the inspiring activity 
of hope into your bosoms, by laying before you those 
invitations which are addressed to the sinner, through 
the peace-speaking blood of Jesus, and, in the name 
of a beseeching God, to win your acceptance of them. 
At no future period can we address arguments more 
powerful and more affecting than these. If these ar- 
guments do not prevail upon you, we know of none 
others by which a victory over the stubborn and un- 
complying will can be accomplished, or by which we 
can ever hope to beat in that sullen front of resistance 
wherewith you now so impregnably withstand us. 
We feel thnt, if any stout-hearted sinner shall rise 



INTRODUCTION. 19 

from the perusal of this "Call to the Unconverted" 
with an unawakened conscience, and give himself up 
to wilful disobedience — we feel as if, in reference to 
him, we had made our last discharge, and it fell 
powerless as water spilt on the ground, that cannot be 
gathered up again. Therefore it is that we speak to 
you now as if this was our last hold of you. We feel 
as if on your present purpose hung all the prepara- 
tions of your future life, and all the rewards or all the 
horrors of your coming eternity. We will not let you 
off with any other repentance than repentance now ; 
and if this be refused now, we cannot, with our eyes 
open to the consideration we have now urged, that 
the instrument we can make to bear upon you here- 
after is not more powerful than we are wielding now, 
coupled with another consideration w r hich we shall 
insist upon, that the subject on which the instrument 
worketh, even the heart of man, gathers, by every 
act of resistance, a more uncomplying obstinacy than 
before ; we cannot, with these two thoughts in our 
mind, look forward to your future history, without 
seeing spread over the whole path of it the iron of a 
harder impenitency — the sullen gloom of a deeper 
and more determined alienation. 

4. Another argument, therefore, for immediate re- 
pentance is, that the mind which resists a present call 
or a present reproof, undergoes a progressive harden- 
ing' toward all those considerations which arm the 
call of repentance with all its energy. It is not enough 
to say, that the instrument by which repentance is 
brought about, is not more powerful to-morrow than 
it is to-day ; it lends a most tremendous weight to the 
argument, to say further, that the subject on which 
this instrument is putting forth its efficiency, will op- 



20 INTRODUCTION. 

pose a firmer resistence to-morrow than it does to-day. 
It is this which gives a significancy so powerful to the 
call of "To-day while it is to-day, harden not your 
hearts ;" and to the admonition of " Knowest thou not, 
O man, that the goodness of God leadeth thee to re- 
pentance; but after, thy hardness and impenitent 
heart treasurest up wrath against the day of wrath 
und revelation of the righteous judgments of God?" 
It is not said, either in the one or in the other of these 
passages, that, by the present refusal, you cut your- 
self off from a future invitation. The invitation may 
be sounded in your hearing to the last half hour of 
your earthly existence, engraved in all those charac- 
ters of free and gratuitous kindness which mark the 
beneficent religion of the New Testament. But the 
present refusal hardens you against the power and 
tenderness of the future invitation. This is the fact 
in human nature to which these passages seem to 
point, and it is the fact through which the argument 
for immediate repentance receives such powerful aid 
from the wisdom of experience. It is this which forms 
the most impressive proof of the necessity of plying 
the young with all the weight and all the tenderness 
of earnest admonition, that the now susceptible mind 
might not turn into a substance harder and more un- 
complying than the rock which is broken in pieces 
by the powerful application of the hammer of the 
word of God. 

The metal of the human soul, so to speak, is like 
some material substances. If the force you lay upon 
it do not break it, or dissolve it, it will beat it into 
hardness. If the moral argument by which it is plied 
now, do not so soften the mind as to carry and to over- 
power its purposes, then, on another day, the argu- 



INTRODUCTION. 21 

ment may be put forth in terms as impressive — but it 
falls on a harder mind, and, therefore, with a more 
slender efficiency. If the threat, that ye who persist 
in 6in shall have to dwell with the devouring fire, and 
to lie down amid everlasting burnings, do not alarm 
you out of your iniquities from this very moment, then 
the same tlireat may be again cast out, and the same 
appalling circumstances of terror be thrown around it, 
but it is all discharged on a soul hardened by its inure- 
ment to the thunder of denunciations already uttered, 
and the urgency of menacing threatenings already 
poured forth without fruit and without efficacy. If 
the voice of a beseeching God do not win upon you 
now, and charm you out of your rebellion against him, 
by the persuasive energy of kindness, then let that 
voice be lifted in your hearing on some future day, 
and though armed with all the power of tenderness 
it ever had, how shall it find its entrance into a heart 
sheathed by the operation of habit, that universal law. 
in more impenetrable obstinacy 1 If, with the earliest 
dawn of your understanding, you have been offered 
the hire of the morning laborer and have refused it, 
then the parable does not say that you are the person 
who at the third, or sixth, or ninth, or eleventh hour, 
will get the offer repeated to you. It is true, that the 
offer is unto all and upon all who are within reach of 
the hearing of it. But there is all the difference in 
the world between the impression of a new offer, and 
of an offer that has already been often heard and as 
often rejected — an offer which comes upon you with 
all the familiarity of a well-known sound that you 
have already learned how to dispose of, and how to 
shut your every feeling against the power of its gra- 
cious invitations — an offer which, if discarded from 
your hearts at the present moment, may come back 



22 LNTR0DDCT10H. 

upon you, but which will have to maintain a more 
unequal contest than before, with an impcnitency ever 
strengthening, and ever gathering new hardness from 
each successive act of resistance. And thus it is that 
the point for which we are contending is not to cany 
you at some future period of your lives, but to carry 
you at this moment. It is to work in you the instan- 
taneous purpose of a firm and a vigorously sustained 
repentance ; it is to put into you all the freshness oi 
an immediate resolution, and to stir you up to all the 
readiness of an immediate accomplishment — it is to 
give direction to the very first footstep you are now 
to take, and lead you to take it as the commencement 
of that holy career in which all old things are done 
away, and all things become new — it is to press it 
upon you, that the state of the alternative, at this mo- 
ment, is "now or never" — it is to prove how fearful 
die odds are against you, if now you suffer the call of 
repentance to light upon your consciences, and still 
keep by your determined posture of careless, and 
thoughtless, and thankless unconcern about God. You 
have resisted to-day, and by that resistance you have 
acquired a firmer metal of resistance against the 
power of every future warning that may be brought 
to bear upon you. You have stood your ground 
against the urgency of the most earnest admonitions, 
and against the dreadfulness of the most terrifying 
menaces. On that ground ycu have fixed yourself 
more immovably than before ; and though on some 
future day the same spiritual thunder be made to play 
around you, it will not shake you out of the obstinacy 
of your determined rebellion. 

It is the universal law of habit, that the feelings are 
always getting more faintly and feebly impressed by 
ever} 7 repetition of the cause which excited them, and 



INTRODUCTION. 23 

tha* the mind i<s always getting etrongcr in its active 
resistance to the impulse of these feelings, by every 
new deed of resistance which it performs ; and thus it 
is, that if you refuse us now, we have no other pros- 
pect before us than that your course is every day 
getting more desperate and more irrecoverable, your 
souls are getting more hardened, the Spirit is getting 
mor**, provoked to abandon those who have so long 
persisted in their opposition to his movements. God, 
who says that h^s Spirit shall not always strive with 
man, is getting more offended. The tyranny of habit 
is getting every day a firmer ascendancy over you; 
Satan is getting you more helplessly involved among 
his wiles and his entanglements; the world, with all 
the inveteracy of those desires winch are opposite to 
the will of the Father, is more and more lording it 
over your every affection. And what, we would ask, 
what is the scene in which you are now purposing to 
contest it, with all this mighty force of opposition you 
are now so busy in raising up against you ? What is 
the field of combat to which you are now looking 
forward, as the place where you are to accomplish a 
victory over all those formidable enemies whom you 
are at present arming with such a weight of hostility, 
as, we say, within a single hairbreadth of certainty, 
you will find to be irresistible? O the bigness of such 
a misleading infatuation 1 The proposed scene in 
I which this battle for eternity is to be fought, and this 
\ictory for the crown of glory is to be won, is a death- 
bed. It is when the last messenger stands by the 
couch of the dying man, and shakes at him the ter- 
rors of his grisly countenance, that the poor child of 
infatuation thinks he is to struggle and prevail against 
all his enemies; against the unrelenting tyranny of 
habit — against the obstinacy of his own heart, which 



24 INTRODUCTION. 

he is now doing bo much to harden — against the 
Spirit of God who perhaps long ere now lias pro- 
nounced the doom upon him, " He will take his own 
way, and walk in his own counsel ; I shall cease from 
striving, and let him alone "—against Satan, to whom 
every day of his life he has given some fresh advan- 
tage over him, and who will not be willing to lose 
ihe victim on whom he has practised so many wiles, 
and plied Avith success so many delusions. And such 
are the enemies whom you, who wretchedly calculate 
on the repentance of the eleventh hour, are every day 
mustering up in greater force and formidablenesa 
against you ; and how can we think of letting you 
go with any other repentance than the repentance of 
the precious moment that is now passing over you, 
when we look forward to the horrors of that impressive 
scene on which you propose to win the prize of im- 
mortality, and to contest it singlehanded and alone, 
with all the weight of opposition which you have 
accumulated against yourselves — a deathbed — a lan- 
guid, breathless, tossing, and agitated deathbed; that 
scene of feebleness, when the poor man cannot help 
himself to a single mouthful — when he must have 
attendants to sit around him, and watch his every 
. wish, and interpret his every signal, and turn him to 
every posture where he may find a moment's ease, 
and wipe aw?\y the cold sweat that is running over 
him — and ply him with cordials for thirst, and sick- 
ness, and insufferable languor. And this is the time, 
"when occupied with such feelings, and beset with 
such agonies as these, you propose to crowd within 
the compass of a few wretched days the work ol 
winding up the concerns of a neglected eternity! 

5. But it may be said, "If repentance be what you 
cepresent it, a tiling of such mighty import, and sucb 



INTRODUCTION. 25 

impracticable performance, as a change of mind, in 
what rational way can it be made the subject of a 
precept or injunction? you would not call upon the 
Ethiopian to change his skin — you would not call 
upon the leopard to change his spots j and yet you call 
upon us to change our minds. You say, " Repent ;" 
and that too in the face of the undeniable doctrine, that 
man is without strength for the achievement of so 
mighty an enterprise. Can you tell us any plain and 
practicable thing that you would have us tD perform, 
and that we may perform, to help on this business?" 
This is the very question with which the hearers of 
John the Baptist came back upon him, after he had 
told them in general terms to repent, and to bring forth 
fruits meet for repentance. He may not have resolved 
the difficulty, but he pointed the expectation of his 
countrymen to a greater than he for the solution of it. 
Now that Teacher has already come, and we live 
under the full and the finished splendor of his revela- 
tion. O that the greatness and difficulty of the work 
of repentance had the effect of shutting you up into 
the faith of Christ ! Repentance is not a paltry, super- 
ficial reformation. It reaches deep into the inner man, 
but not too deep for the searching influences of that 
Spirit which is at his giving, and which worketh 
mightily in the hearts of believers. You should go 
then under a sense of your difficulty to Him. Seek 
to be rooted in the Savior, that you may be nourished 
out of his fulness, and strengthened by iiis might. 
The simple cry for a clean heart, and a right spirit, 
which is raised from the mouth of a believer, brings 
down an answer from on high which explains all the 
difficulty and overcomes it. And if what we have 
eaid of the extent and magnitude of repentance, should 
have the effort to give a deeper feeling than before of 

Bax.Call. 3 



26 INTRODUCTION. 

the wants under which you labor ; and shall dispose 
you to seek after a closer and more habitual urnon 
with Him who alone can supply them, then will our 
call to repent have indeed fulfilled upon you the ap- 
pointed end of a preparation for the Savior. But re- 
collect now is your time, and now is your opportunity, 
for entering on the road of preparation that leads to 
heaven. We charge you to enter this road at this 
moment, as you value your deliverance from hell, and 
your possession of that blissful place where you shall 
be for ever with the Lord — we charge you not to 
parry and to delay this matter, no not for a single 
hour — we call on you by all that is great in eternity — 
by all that is terrifying in its horrors — by £.11 that ia 
alluring in its rewards — by all that is binding in the 
authority of God — by all that is condemning in the 
ee\ erity of his violated law, and by ail that can aggra- 
vate this condemnation in the insulting contempt of 
his rejected gospel ; — we call on you by one and ah 
of these considerations, not to hesitate, but to flee — 
not to purpose a return for to-morrow, but to make 
an actual return this very day — to put a decisive end 
to every plan of wickedness on which you may havw 
entered — to cease your hands from all that is ibrbid- 
den — to turn them to all that is required — to betake 
yourselves to the appointed Mediator, and receive 
through him, by the prayer of faith, -such constant 
supplies of the washing of regeneration and renewing 
of the Holy Ghost, that, from this moment, you may 
be carried forward from one degree of grace unto 
another, and from a life devoted to God here, to the 
elevation of a triumphant, and the joys of a blissfirl 
eternity hereafter. T. C 

8t Andrew'*, October, 1825. 



CONTENTS. 

Hie Text opened, . . 31 

Doctrine I. — It is the unchangeable law of God, that 
wicked men must turn or die — Proved, . 34 

God will not be so unmerciful as to damn us — 
Answered, ..... 37 

The Use, ... .40 

Who are wicked men, and wnat conversion is; and 
how we may know whether Ave are wicked or con- 
verted, ..... 43 
Applied, ..... 50 

Doct. II. — It is the promise of God that the wicked 
shall live, if they will but turn; unfeignedly and 
thoroughly turn — Proved, . . 6 

Doct, III. — God taketh pleasure in men's conversion 
and salvation, but not in their death or damnation 
He had rather they would turn and live, than go on 
and die — Expounded — Proved, . . 68 

Doct. IV. — The Lord hath confirmed it to us by his 
oath, That he has no pleasure in the death of th* 
wicked, but rather that he turn and live; that ht> 
may leave roan no pretence to question the truth 
of it, 75 

Use. — Who is it, then, that takes pleasure in men's 
sin and death 1 — Not God, nor ministers, nor any 
good men, ..... 76 

Doct. V. — So earnest is God for the conversion of 
sinners, that he doubleth his commands and exhor 
tations with vehemency, "Turn ye, Turn ye," — 
Applied, .... 82 

Some motive* t j obey God's call, and turn, 85 



28 CONTEXTS. 

Doct. VI. — The Lord condescendeth to reason the 
case with unconverted sinners, and ask them, Why 
they will die? .... 9; 

A strange disputation; — 1. For the question. 2. 
The disputants. 

Wicked men will die or destroy themselves. 
Use. — The sinner's case is certainly unreasonable, 102 
Their seeming reasons confuted, . . 108 

Question. — Why are men so unreasonable, and loath 

to turn, and will destroy themselves? — Answered, 119 
Doct. VII. — If after all this, men will not turn, it is 
not God's fault that they are condemned, but their 
own, even their own wilfulness. They die because 
they will; that is, because they will not turn, 122 

Use, 1. — How unfit the wicked are to charge God 
with their damnation. It is not because God is 
unmerciful, but because they are cruel and mer- 
ciless to themselves, . . . 12D 
Object. — We cannot convert ourselves, nor have 

we Free-will — Answered. . . . 134 

Use 2. — The subtlety of Satan, the deceitfulness of 

sin, and the folly of sinners manifested, . 136 

Use, 3. — No w T onder if the wicked would hinder the 

conversion and salvation of others, . . £136 

Use, 4. — Man is the greatest enemy to himself, 137 

Man's destruction is of himself — Proved, . 130 

The heinous aggravations of self-destroying, . 144 

The concluding exhortation, . . . 146 

Ten Directions for those who had rather turn than 
die, 151 



THE GREAT SUCCESS WHICH ATTENDED THE 
CALL WHEN FIRST PUBLISHED. 

It may be proper lo prefix an account of this book given 
by Mr. Baxter himself, which was found in his study, after 
bis death, in his own words: 

" I published a short treatise on conversion, entitled, A 
Call to the Unconverted. The occasion of this was my 
converse with Bishop Usher while I was at London; who, 
approving my method and directions for Peace of Con- 
science, was importunate with me to write directions 
suited to the various states of Christians, and also agains* 
particular sins. I reverenced the man, but disregardea 
these persuasions, supposing I could do nothing but what 
is done better already: but when he was dead, his words 
went deeper to my mind, and I purposed to obey his coun- 
sel; yet, so as that to the first sort of men, the ungodly, 
1 thought vehement persuasions meeter than directions 
only, and so for such I published this little book, which 
God hath blessed with unexpected success, beyond all the 
rest that I have written, except The Saint's Rest. In a 
little more than a year there were about twenty thousand 
of them printed by my own consent, and about ten thou- 
sand since, beside many thousands by stolen impressions, 
which poor men stole for lucre's sake. Through God's 
mercy I have information of almost whole households 
converted by this small book which I set so light by; and, 
as if all this in England, Scotland, and Ireland, were not 
mercy enough to me, God, since I was silenced, hath sent 
it over in life message to many beyond the seas ; xor when 



30 ADVERTISEMENT. 

Mr. Elliot bad printed all the Bible in tbe Indian language, 
be next translated this my Call to the Unconverted, as he 
wrote to us here. And yet God would make some farther 
use of it ; for Mr. Stoop, tbe pastor of the French Churcb 
in London, being driven hence by the displeasure of hi» 
superiors, was pleased to translate it into French. I hopf 
it will not be unprofitable there; nor in Germany, when 
also it has been printed." 

It may be proper further to mention Dr. Bates' account 
of the author, and of this useful treatise. In his sermon 
at Mr. Baxter's funeral, he thus says: 'His books of 
practical divinity have been effectual for more conver- 
sions of sinners to God than any printed in our time : and 
while the churcb remains on earth, will be of continual 
efficacy to recover lost souls. There is a vigorous pulse 
in thern, that keeps the reader awake and attentive. His 
Call to the Unconverted, how small in bulk, but how 
powerful in virtue ! Truth speaks in it with that authority 
and efficacy, that it makes the reader to lay his hand upon 
bis heart, and find that he has a soul and a conscience, 
though he lived before as if he had none. He told some 
friends, that six brothers were converted by reading that 
Call; and that every week he received letters of some 
converted by his books. This he spake witb most hum- 
ble thankfulness, that God was pleased to use him as an 
instrument for the salvation of souls." 



A CALJL, 
TO THS UNCONVERTED. 



EZEKIEL, XXXIII. 11. 

Say unto them, As Hive, saith the Lord God, Ihavt 
no pleasure in the death of the wicked; bid thai 
the wicked turn from his way and live: turn ye* 
turn ye from your evil ways; for why will ye die, 
O house of Israeli 

Jr hath been the astonishing wonder of many a 
man as well as me, to read in the Holy Scriptures how 
few will be saved, and that the greatest part even of 
those that are called, will be everlastingly shut out of 
the kingdom of heaven, and be tormented with the 
devils in eternal fire. Infidels believe not this when 
they read it, and therefore they must feel it ; those 
that do believe it are forced to cry out with Paul, 
(Rom. 11. 13,) " O the depth of the riches both of the 
wisdom and knowledge of God ! How unsearchable 
are his judgments, and his ways past finding out !" 
But nature itself doth teach us all to lay the blame 
of evil works upon the doers ; and therefore when we 
see any heinous thing done, a principle of justice doth 
provoke us to inquire after him that did it, that the 
evil of the work may return the evil of shame upon 
the author. If we saw a man killed and cut in pieces 
by the way, we would presently ask, Oh ! who did 
this cruel deed? If the town was wilfully set on fire, 
you would ask, what wicked wretch did this? So 
when we read that many souls will be miserable in 
hell for ever, we must needs think with ourselves, how 
somes this to pass? and whose fault is it? Who is it 



32 A CALL TO Doct. 1. 

that is so cruel as to be the cause of such a thing as 
ihis? and we can meet with few that will own the 
guilt. It is indeed confessed by all, that Satan is the 
cause; but that doth not resolve the doubt, because 
lie is not the principal cause. He doth not force men 
to sin, but tempts them to it, and leaves it to their 
own wills whether they will do it or not. He doth not 
carry men to an alehouse and force open their mouths 
and pour in the drink ; nor doth he hold them that 
they cannot go to God's service ; nor doth he force 
their hearts from holy thoughts. It lieth therefore 
between God himself and the sinner ; one of them 
must needs be the principal cause of all this misery, 
whichever it is, for there is no other to lay it upon; 
and God disclaimeth it ; he will not take it upon him ; 
and the wicked disclaim it usually, and they will not 
take it upon them, and this is the controversy that is 
here managing in my text. 

The Lord complaineth of the people ; and the peo- 
ple think it is the fault of God. The same controversy 
is handled, chap. 18. 25: they plainly say, " that the 
way of the Lord is not equal." So here they say, 
verse 19, " If our transgressions and our sins be upon 
us, and we pine away in them, how shall we then 
live?-' As if they should say, if we must die, and be 
miserable, how can we help it ? as if it were not theii 
fault, but God's. But God, in my text, doth clem 
himself of it, and telleth them how they may help h 
if they will, and persuadeth them to use the means, 
and if they will not be persuaded, he lets them know 
that it is the fault of themselves ; and if this will not 
satisfy them, he will not forbear to punish them. It is 
he that will be the Judge, and he will judge them 
according to their ways ; they are no judge of hirn 



Ooct. I. THE UNCONVERTED. 33 

or of themselves, as wanting authority, and wisdom, 
and impartiality ; nor is it tlie cavilling and quarrelling 
with God that shall serve their turn, or save them 
from the execution of justice, at which they murmur. 
The words of this verse contain, 1. God's purgation 
or clearing himself from the blame of their destruction. 
This he doth not by disowning his law, that the 
wicked shall die, nor by disowning his judgments and 
execution according to that law, or giving them any 
hope that the law shall not be executed ; but by pro- 
fessing that it is not their death that he takes pleasure 
in, but their returning rather, that they may live ; and 
tliis he confirmeth to them by his oath. 2. An ex- 
press exhortation to the wicked to return; wherein 
God doth not only command, but persuade and con- 
descend also to reason the case with them; Why will 
they die ? The direct end of this exhortation is, that 
they may turn and live. The secondary or reserved 
ends, upon supposition that this is not attained, are 
these two : First, To convince them by the means 
which he used, that it is not the fault of God if they 
be miserable. Secondly, To convince them from 
their manifest wilfulness in rejecting all his commands 
and persuasions, that it is the fault of themselves, and 
they die, even because they will die. 

The substance of the text doth lie in these observa- 
tions following : — 

Doctrine 1 . It is the unchangeable law of God, that 
wicked men must turn or die. 

Doctrine 2. It is the promise of God, that the wicked 
shall live, if they will but turn. 

Doctrine 3. God takes pleasure in men's conversion 
and salvation, but not in their death or damnation: ho 



34 A CALL TO Doct. I 

had rather they would return and live, than go on 
and die. 

Doctrine 4. This is a most certain truth, which 
because God would not have men to question, he hath 
confirmed it to them solemnly by his oath. 

Doctrine 5. The Lord doth redouble his commands 
and persuasions to the wicked to turn. 

Doctrine 6. The Lord condescendeth to reason the 
case with them ; and asketh the wicked why they 
will die? 

Doctrine 7. If after all this the wicked will not turn, 
it is not the fault of God that they perish, but of them- 
selves; their own wilfulness is the cause of their 
own damnation ; they therefore die because they 
will die. 

Having laid the text open in these propositions, I 
shall next speak somewhat of each of them in order, 
though briefly. 

DOCTRINE I. 

It is the unchangeable laic of God, that wicked 
men must turn, or die. 

If yon will believe God, believe this : there is but 
one of these two ways for every wicked man, either 
conversion or damnation. I know the wicked will 
hardly be persuaded either of the truth or equity of 
this. No wonder if the guilty quarrel with the law. 
Few men are apt to believe that which they would 
not have to be true, and fewer would have that to be 
true which they apprehended to be against, them. But 
it is not quarrelling with the law, or with the judge, 
that will save the malefactor. Believing and regard- 
ing the law, might have prevented his death ; but 
denying and accusing it will but hasten it. If it were 



OocC I. THE UNCONVERTED. 85 

Tiot so, a hundred would bring their reason against the 
law, for one that would bring his reason to the law, 
and men would rather choose to give their reasons 
why they should not be punished, than to hear the 
commands and reasons of their governors which re- 
quire them to obey. The law was not made for you to 
judge, but that you might be ruled and judged by it. 

But if there be any so blind as to venture to ques- 
tion either the truth or the justice of this law of God, 
I shall briefly give you that evidence of both which 
methinks, should satisfy a reasonable man. 

And first, if you ijpubt whether this be the word of 
God, or not, besides a hundred other texts, you may 
be satisfied by these few:— Matt. 18: 3. "Verily I 
say unto you, except ye be converted and become as 
little children, ye cannot enter into the kingdom of 
God." John 3:3. " Verily, verily, I say unto you, 
except a man be born again he cannot see the king- 
dom of God." 2 Cor. 5: 17. ll If any man be in Christ, 
he is a new creature ; old things are passed away ; 
behold, all things are become new." Col. 3: 9, 10. 
"Ye have put off the old man with his -deeds, and 
have put on the new man, which is renewed in know- 
ledge alter the image of him that, created him.' Heb. 
12: 14. " Without holiness no man shall see the 
Lord." Rom. 8: 8, 9. "So then they that are in the 
flesh cannot please God. Now if any man have not the 
spirit of Christ, he is none of his." Gal. 6: 15. "For 
in Christ Jesus neither circumcision availeth any 
thing, nor uncircumcibion, but a new creature." 1 Pet. 
1:3. " According to hie abundant grace he hath be- 
gotten us to a lively hope." Ver. 23. ."Being born 
again, not of corruptible seed, but of incorruptible, by 
the word of God, wnich liveth and abideth for ever." 



36 A CALL to Dact. I 

1 Pet. 2: 1, 2. "Wherefore laying aside all malice. 
and all guile, and hypocrisies, and envies, and evi] 
speaking, as new born babes, detare the sincere milk 
of the word, that ye may grow thereby." Psalm 9: 
17. " The wicked shall be turned into hell, and all thr 
nations that forget God." Psalm 11 f 4. "And tin. 
Lord lcveth the righteous, but the wicked his sou/ 
liateth." 

As I need not stay to open these texts which are 
so plain, so I think I need not add any more of that 
multitude which speak the like. If thou be a man 
that dost believe the Word of God, here is already 
enough to satisfy thee that the wicked must be con- 
verted or condemned. You are already brought so 
far, that you must either confess that this is true, or 
say plainly, you will not believe the word of God. 
And if once you be come to that pass, there is but 
small hopes of you : look to yourself as well as you 
can, for it is like you will not be long out of hell. You 
would be ready to fly in the face of him that should 
give you the lie ; and yet dare you give the lie to 
God? But if you tell God plainly you will not believe 
him, blame him not if he never warn you more, or if 
he forsake you, and give you up as hopeless ; for to 
what purpose should he warn you, if you will not be- 
lieve him ? Should he send an angel from heaven to 
you, it seems you would not believe. For an angel 
can speak but the word of God ; and if an angel should 
bring you any other gospel, you are not to receive it 
but to hold him accursed. Gal. 1 : 8. And surely there 
is no angel to be believed before the Son of God, who 
came from the Father to bring us this doctrine. If He 
be not to be believed, then all the angels in heaven 
are not to be believed. And if you stand on theao 



0ocV i. THE UNCONVERTED. 37 

terms with God, I shall leave you till he deal with you 
in a more convincing way. God hath a voice that 
will make you hear. Though he entreat you to hear 
the voice of Lis gospel, he will make you hear the 
vDice of his condemning sentence, without entreaty. 
We cannot make you believe against your wills ; but 
God will make you feel against your wills. 

But let us hear what reason you have why you will 
oot believe this word of God, which tells us that the 
wicked must be converted, or condemned. I know 
your reason ; it is because that you judge it unlikely 
that God should be so unmerciful : you think it cruelty 
to damn men everlastingly for so small a tiling as a 
sinful life. And this leads us to the second thing, 
which is to justify the equity of God in Ms laws and 
judgments. 

And first, I think you will not deny that it is most 
suitable to an immortal soul to be ruled by laws that 
promise an immortal reward, and threaten an endless 
punishment. Otherwise the law should not be suited 
to the nature of the subject, who will not be fully 
ruled by any lower means than the hopes or fears of 
everlasting ihings : as it is in cases of temporal pun- 
ishment, if a law were now made that the most hei- 
nous crimes shall be punished with a hundred years' 
captivity, this might be of some efficacy, as being 
^qual to our lives. But, if there had been no other 
penalties before the flood, when men lived eight or 
nine hundred years, it would not have been sufficient, 
because men would know that they might have so 
many hundred years impunity afterward. So it is 
in our present case. 

2. I suppose that you will confess, that the promise 
of an endless and inconceivable priory is not so unsuit- 

Bax. Call. 4 



38 A CALL TO Doct k 

abb to the wisdom of God, or the case of man : ana 
why then should you not think so of the threatening 
of an endless and unspeakable misery ! 

3. When you find it in the word of God that so il 
is, and so it will be, do ye think yourselves fit to con- 
tradict this word ? Will you call your Maker to the 
bar, and examine his word upon the accusation of 
falsehood? Will you sit upon him and judge him by 
the law of your conceits ? Are you wiser, and better, 
and more righteous than he? Must the God of heaven 
come to school to you to learn wisdom 1 Must Infinite 
Wisdom learn of folly, and Infinite Goodness be cor- 
rected by a sinner that cannot keep himself an hour 
clean? Must the Almighty stand at the bar of a 
worm? O horrid arrogancy of senseless dust! shall 
ever mole, or clod, or dunghill, accuse the sun of dark- 
ness, and undertake to illuminate the world ? Where 
were you when the Almighty made the laws, that 
he did not call you to his counsel ? Surely he made 
them before you were born, without desiring your 
advice ; and you came into the world too late to re- 
verse them, if you could have done so great a work. 
You should have stepped out of your nothingness and 
have contradicted Christ when he was on earth, or 
Moses before him, or have saved Adam and his sinful 
progeny from the threatened death, that eo there 
might have been no need of Christ. And what if 
God withdraw his patience and sustaining power, and 
let you drop into hell while you are quarrelling with 
his w T ord, will you then believe that there is a hell ? 

4, If sin be such an evil that it requireth the death 
of Christ for its expiation, no w-onder if it deserve our 
everlasting misery. 



Ooct. 1. THE UNCONVERTED. 39 

5. And if the sin of the devils deserved an endless 
torment, why not also the sin of man ? 

6. And methinks you should perceive that it is not 
possible for the best of men, much less for the wicked, 
to be competent jud ges of the desert of sin. Alas ! we 
are both blind and partial. You can never know fully 
Lhe desert of sin, till you fully know the evil of sin; 
and you can never fully know the evil of sin, till you 
folly know, 1. The excellency cf the soul which it 
deformeth. 2. And the excellency of holiness which 
it obliterates. 3. The reason and excellency of the 
law which it violates. 4. The excellency of the 
glory which it despises. 5. The excellency and of- 
fice of reason which it treadeth down. 6. No, nor till 
you know the infinite excellency, al mightiness and 
holiness of that God against whom it is committed. 
When you fully know all these, you shall fully know 
the desert of sin besides. You know that the offender 
is too partial to judge the law, or the proceeding of 
his judge. We judge by feeling which blinds our 
reason. We see, in common worldly things, that most 
men think the cause is right which is their own, and 
that ail is wrong that is done against them ; and let 
•_he most wise or just impartial friends persuade. them 
to the contrary, and it is all in vain. There are few 
children but think the father is unmerciful, or dealeth 
hardly with them if he whip them. There is scarce 
the vilest wretch but thinketh the church doth wrong 
him if they excommunicate him : or scarce a thief or 
murderer that is hanged, but would accuse the law 
and judge of cruelty, if that would serve their turn. 

7. Can you think that an unholy soul is fit for 
heaven? Alas, they cannot love God here, nor do him 
any service wmich he can accept. They are contrary 



40 a CALL TO Doct. 1. 

to God, they loathe that which he moyt loveth, and 
love that which he abhorreth. They are incapable 
of that imperfect communion with Him which his 
saints here partake of. How then can they live in 
ihat perfect love of him, and full delight and com- 
munion with him, which is the blessedness of heaven? 
Fou do not accuse yourselves of unmerciful ness, if 
you make not your enemy your bosom counsellor ; or 
if you take not your swine to bed and board with you : 
no, nor if you take away his life though he never sin- 
ned ; and yet you will blame the absolute Lord, the 
most wise and gracious Sovereign of the world, if he 
condemn the unconverted to perpetual misery. 

Use. — I beseech you now, all that love your souk, 
that, instead of quarrelling with God and with his 
word, you will presently receive it, and use it for your 
good. All you that are yet unconverted, take this as the 
undoubted truth of God : — You must, ere long, be con- 
verted or condemned ; there is no other way but to 
turn, or die. When God, that cannot lie, hath told 
you this; when you hear it from the Maker and 
Judge of the world, it is time for him that hath ears, 
to hear. By this time you may see what you have 
to trust to. You are but dead and damned men, ex- 
cept you will be converted. Should I tell you other- 
wise, I should deceive you with a lie. Should I hide 
this from you, I should undo you, and be guilty of your 
blood, as the verses before my text assure me. — Verse 
8. " When I say to the wicked man, O wicked man, 
thou shalt surely die ; if thou dost not speak to warn 
the wicked from his way, that wicked man shall die in 
his iniquity; but his blood will I require at thine 
hand." You see then, though this be a rough and 
unwelcome doctrine, it is such as we must preach, and 



Doct. 1. THE UNCONVERTED. 41 

you must hear. It is easier to hear of hell than feel 
it. If your necessities did not require it, we wculd 
not gall your tender ears with truths that seem so 
harsh and grievous. Hell would not be so full, if peo- 
ple were but willing to know their case, and to hear 
and think of it. The reason why so few escape it, is 
because they strive not to enter in at the strait gate of 
conversion, and go the narrow way of holiness, while 
they have time : and they strive not, because they are 
not awakened to a lively feeling of the danger they 
are in ; and they are not awakened because they are 
loth to hear or think of it : and that is partly through 
foolish tenderness and carnal self-love, and partly be- 
cause they do not well believe the word that threat- 
ened it. If you will not thoroughly believe this truth, 
methinks the weight of it should lbrce you to remem- 
ber it, and it should follow you, and give you no rest 
till you are converted. If you had b^t once /»eard 
this word by the voice of an angel, " Thou nvtisi be 
converted, or condemned : turn, or die :" would it mt 
stick in your mind, and haunt you night and day? so 
that in your sinning you would remember it, as if the 
voice were still in your ears, " Turn, or die !" O hap- 
py were your soul if it might thus work with you and 
never be forgotten, or let you alone till it have driven 
home your heart to God. But if you will cast it out 
by forgetfuli: ss or unbelief how can it work to your 
conversion an A salvation 1 But take this with you to 
your sorrow, though you may put this out of 3 r our 
mind, you cannot put it out of the Bible, but there 
it will stand as a sealed truth, which you shall expe- 
rimentally know for ever, that there is no other way 
but, "turn, 01 die," 

what is the matter then that the hearts of si'J- 



42 A CALL TO Doct. 1 

ners are not pierced with such a weighty truth ? A 
man would think now, that every unconverted soul 
that hears these words should be pricked to the heart, 
and think with himself, ' This is my own case,' ar d 
never be quiet till he found himself converted. Believe 
it, this drowsy careless temper will not last long. Con- 
version and condemnation are both of them awaken • 
ing things, and one of them will make you feel ere 
long. I can foretell it as truly as if I saw it with my 
eyes, that either grace or hell will shortly bring these 
matters to the quick, and make you say, " What have 
I done? what a foolish wicked course have I taken?" 
The scornful and the stupid state of sinners will last 
but a little while : as soon as they either turn or die, 
the presumptuous dream will be at an end, and then 
their wits and feeling will return. 

But I foresee there are two things that are likely tc 
harden the unconverted, and make me lose all my 
labor, except they can be taken out of the way ; and 
that is the misunderstanding on those two words, the 
wicked and turn. Some will think to themselves, 
* It is true, the wicked must turn or die ; but what is 
that to me, I am not wicked ; though I am a sinner, 
all men arc.' Others will think, ' It is true that we 
must turn from our evil ways, but I am turned long 
ago ; I hope this is not now to do.* And thus while 
wicked men think they are not wicked, but are al- 
ready converted, we lose all our labor in persuading 
them to turn. I shall therefore, before I go any fur- 
ther, tell you here who are meant by the wicked j 
and who they are that must turn or die; and also 
what is meant by turning, and who they are that are 
truly converted. And this I have purposely reserved 
for th ; s place, preferring the method that fits my end 



Doct.1. THE UNCONVERTED. 43 

And here you may observe, that in the sense of the 
text, a wicked man and a converted man are contra- 
ries. No man is a wicked man that is converted ; and 
no man is a converted man that is wicked ; so that to 
be a wicked man and to be an unconverted man, is 
all one ; and therefore in opening one, we shall open 
both. 

Before I can tell you what either wickedness or con- 
version is, I mu^t go to the bottom, and fetch up the, 
matter from the beginning. 

It pleased the great Creator of the world to make 
ihree sorts of living creatures. Angels he made pure 
spirits without flesh, and therefore he made them only 
for heaven, and not to dwell on earth. Brutes were 
made flesh, without immortal souls, and therefore 
they were made only for earth, and not for heaven. 
Man is of a middle nature, between both, as partak- 
ing of both flesh and spirit, and therefore he was made 
both for heaven and earth. But as his flesh is made 
to be but a servant to his spirit, so is he made for earth 
but as his passage or way to heaven, and not that this 
should be his home or happiness. The blessed state 
that man was made for, was to behold the glorious 
majesty of the Lord, and to praise him among his 
Holy Angels, and to love him, and to be filled with 
his love for ever. And as this was the end that man 
was made for, so God did give him means that were 
fitted to the attaining of it. These means were prin- 
cipally two : First, the right inclination and disposi- 
tion of the mind of man. Secondly, The right order- 
ing of his life and practice. For the first, God suited 
the disposition of man unto his end, giving him such 
knowledge of God as was fit for his present state, and 
a heart disposed and inclined to God in holy love. But 



44 A CALL TO Doct - 

yet he did not fix or confirm him in this condition, but, 
having made him a free agent, lie left him in the 
hands of his own free will. For the second, God did 
that which belonged to him ; that is, he gave him a 
perfect law, required him to continue in the love of 
God, and perfectly to obey him. By the wilful breach 
of this law, man did not only forfeit his hopes of ever- 
lasting life, but also turned his heart from God, and 
fixed it on these lower fleshly things, and hereby blot- 
ted out the spiritual image of God from his soul ; so 
that man did both fall short of the glory of God, which 
was his end, and put himself out of the way by which 
he should have attained it, and this both as to the 
frame of his heart, and of his life. The holy inclina- 
tion and love of his soul to God, he lost, and instead 
of it he contracted an inclination and love to the plea- 
sing of his flesh, or carnal self, by earthly things ; 
growing strange to God and acquainted with the 
creature. And the course of this life was suited to 
the bent and inclination of his heart ; he lived to his 
carnal self, and not to God ; he sought the creature, 
for the pleasing of his flesh, instead of seeking to please 
the Lord. With this nature or corrupt inclination, 
we are all now born into the world ; " for who can 
bring a clean thing out of an unclean ?" Job, 14 : 4. 
As a lion hath a fierce and cruel nature before he doth 
devour; and an adder hath a venomous nature before 
she sting, so in our infancy we have those sinful na- 
tures or inclinations, before we think, or speak, or do 
amiss. And hence springeth all the sin of our lives; 
and not only so, but when God hath, of his mercy, pro- 
vided us a remedy, even the Lord Jesus Christ, to be 
the Savior of our souls, and bring us back to God 
again, we naturally love our present state, and are 



Doct. 1. THE UNCONVERTED. 45 

loth to be brought out of it, and therefore are set 
against the means of our recovery: and though cus- 
tom hath taught us to thank Christ for his good-will, 
j r et carnal self persuades us to refuse his remedies, and 
to desire to be excused when we are commanded to 
take the medicines which he offers, and are called to 
forsake all and follow him to God and glory. 

I pray you read over this leaf again, and mark it ; 
for in these few words you have a true description of 
our natural state, and consequently of wicked man ; 
for every man that is in the state of corrupted nature 
is a wicked man, and in a state of death. 

By this also you are prepared. to understand what 
it is to be converted : to which end you must further 
know, that the mercy of God, not willing that man 
should perish in his sin, provided a remedy, by caus- 
ing his Son to take our nature, and being, in one per- 
son, God and man, to become a mediator between 
God and man ; and by dying for our sins on the cross, 
to ransom us from the curse of God and the power of 
the devil. And having thus redeemed us, the Father 
hath delivered us into his hands as his own. Here- 
upon the Father and the Mediator do make a new 
law and covenant for man, not like the first, which 
gave life to none but the perfectly obedient, and con- 
demned man for every sin ; but Christ hath made a 
law of grace, or a promise of pardon and everlasting 
life to all that, by true repentance, and by faith in 
Christ, are converted unto God ; like an act of oblivion, 
which is made by a prince to a company of rebels, on 
condition they will lay down their arms and come in 
and be loyal subjects for the time to come. 

But, because the Lord knoweth that the heart of 
man is grown so wicked, that, for all this, men will 



46 A CALL TO Oot/L 1 

not accept of the remedy if they be left to themselves 
therefore the Holy Ghost hath undertaken it as hia 
office to inspire the Apostles, and seal the Scriptures 
by miracles and wonders, and to illuminate and con- 
vert the souls of the elect. 

So by this much you see, that as there are three 
persons in the Trinity, the Father, the Son, and the 
Holy Ghost, so each of these persons have their several 
works, which are eminently ascribed to them. 

The Father's works were, to create us, to rule us, 
as his rational creatures, by the law of nature, and 
judge us thereby; and in mercy to provide us a Re- 
deemer when we were lost ; and to send his Son, and 
accept his ransom. 

The works of the Son for us were these : to ransom 
and redeem us by his suffering and righteousness; ; to 
give out the promise or law of grace, and rule and 
judge the world as their Redeemer, on terms of grace : 
and to make intercession for us, that the benefits of his 
death may be communicated ; and to send the Holy 
Ghost, which the Father also doth by the Son. 

The works of the Holy Ghost, for us, are these : to 
indite the Holy Scriptures, by inspiring araJ guiding 
the Apostles, and sealing the word, by his miraculous 
gifts and works, and the illuminating and exciting the 
ordinary ministers of the gospel, and so enabling them 
and helping them to publish that word; and by the 
same word illuminating and converting the souls of 
men. So that as you could not have been reasonable 
creatures, if the Father had not created you, nor have 
had any access to God, if the Son bad not died, so 
neither can you have a part in Christ, or be saved, 
except the Holy Ghost do sanctify you. 

So that by this time you may see the several causes 



Doct. 1. TOE UNCONVERTED. 47 

of this work. The Father sendeth the Son : the Son 
redeemeth us and maketh the promise of grace : the 
Holy Ghost inditeth and sealeth this Gospel: the 
Apostles are the secretaries of the Spirit to write it: 
the preachers of the Gospel to proclaim it, and per- 
suade men to open it : and the Holy Ghost doth make 
their preaching effectual, by opening the hearts of 
men to entertain it. And all this to repair the image 
of God upon the soul, and to ml the heart upon God 
again, and take it off the creature and carnal self to 
which it is revolted, and so to turn the current of the 
life Into a heavenly course, which before was earthly ; 
and through this, embracing Christ by faith, who is 
me Physician of the soul. 

By what I have said, you may see what it is to be 
wicked, and what it is to be converted ; which, I think, 
will yet be plainer to you, if I describe them as con- 
sisting of their several parts^ And fcr the first, a wicked 
man may be known by these three things : 

First, He is one who placeth his chief affections on 
garth, and loveth the creature more than God, and 
his fleshly prosperity above the heavenly felicity. He 
savoreth the things of the flesh, but neither discern- 
eth n&r savoreth the things of the Spirit; though he 
will say, that heaven is better thars earth, yet he doth 
not really so esteem it to himself. If he might be sure 
of earth, he would let go heaven, and had rather stay 
here than be removed thither. A life of perfect holi- 
ness in the sight of God, and in his love and praisea 
for ever in heaven, doth not find such liking with his 
heart as a life of health, and wealth, and honor here 
upon earth. And though he falsely profess thai he 
loves God above all, yet indeed he never felt the power 
of divine love within him, but his mind is more set on 



48 A CALL TO Doci. I 

worldly or fleshly pleasures than on God. In a word, 
whoever loves earth above heaven, and fleshly pros- 
perity more than God, is a wicked unconverted man. 

On the other hand, a converted man is illuminated 
to discern the loveliness of God, and so far believeth 
the glory that is to be had with God, that his heart 
is taken up with it and set more upon it than any 
thing in this world. He had rather see the face of 
God, and live in his ev^iasting love and praises, than 
have all the wealth or pleasures of the world. He 
seeth that all things else are vanity, and nothing but 
God can fill the soul ; and therefore let the world go 
which way it will, he layeth up his treasures and 
hopes in heaven, and for that he is resolved to let go 
all. As the fire doth mount upward, and the needle 
that is touched with the loadstone still turns to the 
north, so the converted soul is inclined unto God. No- 
thing else can satisfy him : nor can he find any con- 
tent and rest but in his love. In a word, all that are 
converted do esteem and love God better than all the 
tcorld, and the heavenly felicity is dearer to them 
than their fleshly prosperity. The proof of what I 
have said you may find in these places of Scriptures: 
Phil. 3: 18, 21. Matt. 6 : 19, 20, 21. Col. 3 : 1, 4. 
Rom. 8 : 5, 9, 18, 23. Psalm 73 : 25, 26. 

Secondly, A wicked man is one that makes ft the 
principal business of his life to prosper in the world, 
and attain his fleshly ends. And though he may read, 
and hear, end do much in the outward duties oC reli- 
gion, and forbear disgraceful sins, yet this is all but 
by-the-by, and he never makes it the principal busi- 
ness of his life to please God, and attain everlast- 
ing glory, and puts off God with the leavings of the 
world, and gives him no more service than the flesh 



Doct L THE UNCONVERTED. 40 

can spare, for he will not part with all for heaven. 

On the contrary, a converted man is one that makes 
it the principal care and business of his life to please 
God, and to be saved, and takes all the blessings of 
this life but as accommodations in his journey toward 
another life, and useth the creature in subordination 
to God j he loves a holy life, and longs to be more 
holy ; he hath no sin but what he hateth, and longeth, 
and prayeth, and striveth to be rid of. The drift and 
bent of his life is for God, and if he sin, it is contrary 
to the very bent of his heart and life ; and therefore he 
riseth again and lamenteth it, and dares not wilfully 
live in any imown sin. There is nothing in this world 
go dear to him but he can. give it up to God, and for- 
sake it for him and the hopes of glory. AH this you 
may see in Col. 3 : 1, 5. Matt. 6 : 20, 33. Luke, 18 : 
22, 23, 29. Luke, 14 : 18, 24, 26, 27. Rom. 8 : 13. 
Gal. 5 : 24. Luke 12 : 21, &c. 

Thirdly, The soul of a wicked man did never truly 
discern and relish the mystery of redemption, nor 
thankfully entertain an offered Savior, nor is he taken 
up with the love of the Redeemer, nor willing to be 
ruled by him as the Physician of his soul, that he may 
be saved from the guilt and power of his sins, and re- 
covered to God ; but his heart is insensible of this un- 
speakable benefit, and is quite against the healing 
means by which he should be recovered. Though he 
may be willing to be outwardly religious, yet he never 
resigns up Ins soul to Christ, and to the motions and 
conduct of his word and Spirit. 

On the contrary, the converted soul having felt 
himself undone by sin, and perceiving that he hath 
lost his peace with God and hopes of heaven, and is in 
danger of everlasting misery, doth tliankfallv enter- 

Bas. Call. s 



50 A CALL TO Doct. 1 

tain the tidings of redemption, and believing in the 
Lord Jesus as his only Savior, resigns himself up to 
him for wisdom, righteousness, sanctification, and re- 
demption. He takes Christ as the life of his soul, and 
lives by him, and uses him as a salve for every sore, 
admiring the wisdom and love of God in this wonder- 
ful work of man's redemption. In a word, Christ doth 
even dwell in his heart by faith, and the life that he 
now liveth, is by the faith of the Son of God, that 
loved him, and gave himself for him ; yea, it is not so 
much he that liveth, as Christ in him. For these, 
see Job, 1 : 11, 12; and 3 : 19, 20. Rom. 8 : 9. Phil. 
3 : 7, 10. Gal. 2 : 2Q. Job, 15 : 2, 3, 4. 1 Cor. 1 : 20. 
2:2. 

You see now, in plain terms from the Word of God, 
who are the wicked and who are the converted. Igno- 
rant people think, that if a man be no swearer, nor 
curser, nor railer, nor drunkard, nor fornicator, nor ex- 
tortioner, nor wrong any body in his dealings, and if 
he come to church and say his prayers, he cannot be 
a wicked man. Or if a man that hath been guilty 
of drunkenness, swearing, or gaming, or the like vices, 
do but forbear them for the time to come, they think 
that this is a converted man. Others think if a man 
that hath been an enemy, and scorner at godliness, 
do but approve it, and be hated for it by the wicked, 
as the godly are, that this must needs be a converted 
man. And some are so foolish as to think that they 
are converted by taking up some new opinion, and 
falling into some dividing party. And some think, 
if they have but been affrighted by the fears of hell, 
and had convictions of conscience, and thereupon 
have purposed and promised amendment, and take up 
a life of civil behavior and outward religion, that this 



Doct. 1. THE UNCONVERTED. 51 

must needs be true conversion. And these are the 
poor deluded souls that are like to lose the benefit of 
all our persuasions j and when they hear that the 
wicked must turn or die, they think that this is not 
3poken to them, for they are not wicked, but are turned 
already. And therefore it is that Christ told some of 
the rulers of the Jews who were greater and more 
civil than the common people, that " publicans and 
harlots go into the kingdom of Christ before them." 
Matt. 21 : 31. Not that a harlot, or gross sinner can 
be saved without conversion ; but because it was easier 
to make these gross sinners perceive their sin and mi- 
sery, and the necessity of a change, than the more 
civil sort, who delude themselves by thinking that 
they are converted already, when they are not. 

O sirs, conversion is another kind of work than most 
are aware of. It is not a small matter to bring an 
earthly mind to heaven, and to show man the amiable 
excellence of God, till he be taken up in such love to 
him that can never be quenched ; to break the heart 
for sin, and make him fly for refuge to Christ, and 
thankfully embrace him as the life of his soul ; to have 
the very drift and bent of the heart and life changed ; 
so that a man renounceth that which he took for his 
felicity, and placeth his felicity where he never did 
before, and lives not to the same end, and drives not 
on the same design in the world, as he formerly did. 
In a word, he that is in Christ is a " new creature : 
old things are passed aAvay : behold, all things are 
become new." 2 Cor. 5 : 17. He hath a new under- 
standing, a new will and resolution, new sorrows, and 
desires, and love, and delight; new thoughts, new 
speeches, new company, (if possible,) and a new con- 
versation. Sin, that before was a jesting matter witj? 



52 A CALL TO Doct. 1 

him. is now so odious and terrible to him that he flies 
from it as from death. The world, that was so lovely 
in his eyes, doth now appear but as vanity and vexa- 
tion : God, that was before neglected, is now the only 
happiness of his soul : before he was forgotten, and 
every lust preferred before him, but now he is set next 
the heart, and all things must give place to him ; the 
heart is taken u
Download .txt
gitextract_vbmxaw27/

├── .Rbuildignore
├── .gitignore
├── .travis.yml
├── CONDUCT.md
├── DESCRIPTION
├── LICENSE
├── Makefile
├── NAMESPACE
├── NEWS.md
├── R/
│   ├── RcppExports.R
│   ├── TextReuseCorpus.R
│   ├── TextReuseTextDocument.R
│   ├── align_local.R
│   ├── conversion-functions.R
│   ├── filenames.R
│   ├── lsh.R
│   ├── lsh_candidates.R
│   ├── lsh_compare.R
│   ├── lsh_probability.R
│   ├── lsh_query.R
│   ├── lsh_subset.R
│   ├── minhash.R
│   ├── pairwise_candidates.R
│   ├── pairwise_compare.R
│   ├── parallel.R
│   ├── rehash.R
│   ├── similarity.R
│   ├── textreuse-package.r
│   ├── token_index.R
│   ├── tokenize.R
│   ├── tokenizers.R
│   ├── utils.R
│   └── wordcount.R
├── README.Rmd
├── README.md
├── _pkgdown.yml
├── appveyor.yml
├── cran-comments.md
├── inst/
│   └── extdata/
│       ├── ats/
│       │   ├── calltounconv00baxt.txt
│       │   ├── gospeltruth00whit.txt
│       │   ├── lifeofrevrichard00baxt.txt
│       │   ├── memoirjamesbrai00ricegoog.txt
│       │   ├── practicalthought00nev.txt
│       │   ├── remember00palm.txt
│       │   ├── remembermeorholy00palm.txt
│       │   └── thoughtsonpopery00nevi.txt
│       └── legal/
│           ├── ca1851-match.txt
│           ├── ca1851-nomatch.txt
│           └── ny1850-match.txt
├── man/
│   ├── TextReuseCorpus.Rd
│   ├── TextReuseTextDocument-accessors.Rd
│   ├── TextReuseTextDocument.Rd
│   ├── align_local.Rd
│   ├── as.matrix.textreuse_candidates.Rd
│   ├── filenames.Rd
│   ├── hash_string.Rd
│   ├── lsh.Rd
│   ├── lsh_add.Rd
│   ├── lsh_candidates.Rd
│   ├── lsh_compare.Rd
│   ├── lsh_probability.Rd
│   ├── lsh_query.Rd
│   ├── lsh_subset.Rd
│   ├── minhash_generator.Rd
│   ├── pairwise_candidates.Rd
│   ├── pairwise_compare.Rd
│   ├── reexports.Rd
│   ├── rehash.Rd
│   ├── similarity-functions.Rd
│   ├── textreuse-package.Rd
│   ├── token_index.Rd
│   ├── token_index_candidates.Rd
│   ├── tokenize.Rd
│   ├── tokenizers.Rd
│   └── wordcount.Rd
├── pkgdown/
│   └── extra.css
├── src/
│   ├── RcppExports.cpp
│   ├── hash_string.cpp
│   ├── shingle_ngrams.cpp
│   ├── skip_ngrams.cpp
│   └── sw_matrix.cpp
├── tests/
│   ├── testthat/
│   │   ├── newman.txt
│   │   ├── test-TextReuseCorpus.R
│   │   ├── test-TextReuseTextDocument.R
│   │   ├── test-alignment.R
│   │   ├── test-candidate_pairs.R
│   │   ├── test-filenames.R
│   │   ├── test-hashing.R
│   │   ├── test-jaccard.R
│   │   ├── test-lsh.R
│   │   ├── test-minhash.R
│   │   ├── test-pairwise_cf.R
│   │   ├── test-ratio_of_matches.R
│   │   ├── test-token_index.R
│   │   ├── test-tokenizers.R
│   │   ├── test-utils.R
│   │   └── test-wordcount.R
│   └── testthat.R
└── vignettes/
    ├── textreuse-alignment.Rmd
    ├── textreuse-introduction.Rmd
    ├── textreuse-minhash.Rmd
    └── textreuse-pairwise.Rmd
Download .txt
SYMBOL INDEX (9 symbols across 5 files)

FILE: src/RcppExports.cpp
  function RcppExport (line 15) | RcppExport SEXP _textreuse_hash_string(SEXP xSEXP) {
  function RcppExport (line 26) | RcppExport SEXP _textreuse_shingle_ngrams(SEXP wordsSEXP, SEXP nSEXP) {
  function RcppExport (line 38) | RcppExport SEXP _textreuse_skip_ngrams(SEXP wordsSEXP, SEXP nSEXP, SEXP ...
  function RcppExport (line 51) | RcppExport SEXP _textreuse_sw_matrix(SEXP mSEXP, SEXP aSEXP, SEXP bSEXP,...
  function RcppExport (line 75) | RcppExport void R_init_textreuse(DllInfo *dll) {

FILE: src/hash_string.cpp
  function IntegerVector (line 13) | IntegerVector hash_string(std::vector < std::string > x) {

FILE: src/shingle_ngrams.cpp
  function CharacterVector (line 6) | CharacterVector shingle_ngrams(CharacterVector words, int n) {

FILE: src/skip_ngrams.cpp
  function CharacterVector (line 8) | CharacterVector skip_ngrams(CharacterVector words, int n, int k) {

FILE: src/sw_matrix.cpp
  function IntegerMatrix (line 7) | IntegerMatrix sw_matrix(IntegerMatrix m, CharacterVector a, CharacterVec...
Condensed preview — 102 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (3,293K chars).
[
  {
    "path": ".Rbuildignore",
    "chars": 204,
    "preview": "^.*\\.Rproj$\n^\\.Rproj\\.user$\n^\\.git$\n^\\.r-lib$\n^README\\.Rmd$\n^README-*\\.png$\n^data-raw$\n^\\.travis\\.yml$\nwordnet\n^appveyor"
  },
  {
    "path": ".gitignore",
    "chars": 82,
    "preview": ".Rproj\n*.Rproj\n.Rproj.user\n.Rhistory\n.RData\n.Ruserdata\nsrc/*.o\nsrc/*.so\nsrc/*.dll\n"
  },
  {
    "path": ".travis.yml",
    "chars": 391,
    "preview": "language: r\nr:\n  - oldrel\n  - release\n  - devel\nsudo: false\ncache: packages\n\nafter_success:\n  - Rscript -e 'covr::codeco"
  },
  {
    "path": "CONDUCT.md",
    "chars": 1387,
    "preview": "# Contributor Code of Conduct\n\nAs contributors and maintainers of this project, we pledge to respect all people who \ncon"
  },
  {
    "path": "DESCRIPTION",
    "chars": 1394,
    "preview": "Package: textreuse\nType: Package\nTitle: Detect Text Reuse and Document Similarity\nVersion: 1.0.1\nDate: 2026-05-06\nAuthor"
  },
  {
    "path": "LICENSE",
    "chars": 60,
    "preview": "YEAR: 2026\nCOPYRIGHT HOLDER: Yaoxiang Li and Lincoln Mullen\n"
  },
  {
    "path": "Makefile",
    "chars": 216,
    "preview": ".PHONY : docs deploy-docs\n\ndocs :\n\tRscript -e \"pkgdown::clean_site(); pkgdown::build_site(run_dont_run = TRUE)\"\n\ndeploy-"
  },
  {
    "path": "NAMESPACE",
    "chars": 3227,
    "preview": "# Generated by roxygen2: do not edit by hand\n\nS3method(\"[\",TextReuseCorpus)\nS3method(\"[[\",TextReuseCorpus)\nS3method(\"con"
  },
  {
    "path": "NEWS.md",
    "chars": 3346,
    "preview": "# textreuse 1.0.1\n\nThis release brings together several years of maintenance and feature work to\nmake textreuse easier t"
  },
  {
    "path": "R/RcppExports.R",
    "chars": 750,
    "preview": "# Generated by using Rcpp::compileAttributes() -> do not edit by hand\n# Generator token: 10BE3573-1514-4C36-9D1C-5A225CD"
  },
  {
    "path": "R/TextReuseCorpus.R",
    "chars": 8945,
    "preview": "#' TextReuseCorpus\n#'\n#' This is the constructor function for a \\code{TextReuseCorpus}, modeled on the\n#' virtual S3 cla"
  },
  {
    "path": "R/TextReuseTextDocument.R",
    "chars": 11127,
    "preview": "#' TextReuseTextDocument\n#'\n#' This is the constructor function for \\code{TextReuseTextDocument} objects.\n#' This class "
  },
  {
    "path": "R/align_local.R",
    "chars": 10565,
    "preview": "#' Local alignment of natural language texts\n#'\n#' This function takes two texts, either as strings or as\n#' \\code{TextR"
  },
  {
    "path": "R/conversion-functions.R",
    "chars": 1557,
    "preview": "#' Convert candidates data frames to other formats\n#'\n#' These functions convert a \\code{textreuse_candidates} object to"
  },
  {
    "path": "R/filenames.R",
    "chars": 820,
    "preview": "#' Filenames from paths\n#'\n#' This function takes a character vector of paths and returns just the file\n#' name, by defa"
  },
  {
    "path": "R/lsh.R",
    "chars": 6690,
    "preview": "#'Locality sensitive hashing for minhash\n#'\n#'Locality sensitive hashing (LSH) discovers potential matches among a corpu"
  },
  {
    "path": "R/lsh_candidates.R",
    "chars": 1294,
    "preview": "#' Candidate pairs from LSH comparisons\n#'\n#' Given a data frame of LSH buckets returned from \\code{\\link{lsh}}, this\n#'"
  },
  {
    "path": "R/lsh_compare.R",
    "chars": 2656,
    "preview": "#' Compare candidates identified by LSH\n#'\n#' The \\code{\\link{lsh_candidates}} only identifies potential matches, but\n#'"
  },
  {
    "path": "R/lsh_probability.R",
    "chars": 2222,
    "preview": "#' Probability that a candidate pair will be detected with LSH\n#'\n#' Functions to help choose the correct parameters for"
  },
  {
    "path": "R/lsh_query.R",
    "chars": 1397,
    "preview": "#' Query a LSH cache for matches to a single document\n#'\n#' This function retrieves the matches for a single document fr"
  },
  {
    "path": "R/lsh_subset.R",
    "chars": 828,
    "preview": "#' List of all candidates in a corpus\n#'\n#' @param candidates A data frame of candidate pairs from\n#'   \\code{\\link{lsh_"
  },
  {
    "path": "R/minhash.R",
    "chars": 2597,
    "preview": "#' Generate a minhash function\n#'\n#' A minhash value is calculated by hashing the strings in a character vector to\n#' in"
  },
  {
    "path": "R/pairwise_candidates.R",
    "chars": 1406,
    "preview": "#' Candidate pairs from pairwise comparisons\n#'\n#' Converts a comparison matrix generated by \\code{\\link{pairwise_compar"
  },
  {
    "path": "R/pairwise_compare.R",
    "chars": 2886,
    "preview": "#' Pairwise comparisons among documents in a corpus\n#'\n#' Given a \\code{\\link{TextReuseCorpus}} containing documents of "
  },
  {
    "path": "R/parallel.R",
    "chars": 413,
    "preview": "# Check if the option `mc.cores` has been set. If it has, return `mclapply`\n# instead of `lapply`. But in no circumstanc"
  },
  {
    "path": "R/rehash.R",
    "chars": 2119,
    "preview": "#' Recompute the hashes for a document or corpus\n#'\n#' Given a \\code{\\link{TextReuseTextDocument}} or a\n#' \\code{\\link{T"
  },
  {
    "path": "R/similarity.R",
    "chars": 6157,
    "preview": "#' Measure similarity/dissimilarity in documents\n#'\n#' A set of functions which take two sets or bag of words and measur"
  },
  {
    "path": "R/textreuse-package.r",
    "chars": 2099,
    "preview": "#' @details\n#' The best place to begin with this package in the introductory vignette.\n#'\n#' \\code{vignette(\"textreuse-i"
  },
  {
    "path": "R/token_index.R",
    "chars": 2765,
    "preview": "#' Build an index of tokens and documents\n#'\n#' Build an inverted index from tokens to the documents that contain them. "
  },
  {
    "path": "R/tokenize.R",
    "chars": 3216,
    "preview": "#' Recompute the tokens for a document or corpus\n#'\n#' Given a \\code{\\link{TextReuseTextDocument}} or a\n#' \\code{\\link{T"
  },
  {
    "path": "R/tokenizers.R",
    "chars": 2132,
    "preview": "#' Split texts into tokens\n#'\n#' These functions each turn a text into tokens. The \\code{tokenize_ngrams}\n#' functions r"
  },
  {
    "path": "R/utils.R",
    "chars": 2720,
    "preview": "# Take results of readLines and turn it into a character vector of length 1\nas_string <- function(x) {\n  x %>%\n    str_c"
  },
  {
    "path": "R/wordcount.R",
    "chars": 708,
    "preview": "#' Count words\n#'\n#' This function counts words in a text, for example, a character vector, a\n#' \\code{\\link{TextReuseTe"
  },
  {
    "path": "README.Rmd",
    "chars": 7396,
    "preview": "---\noutput: md_document\ntitle: Detect Text Reuse and Document Similarity\n---\n\n<!-- README.md is generated from README.Rm"
  },
  {
    "path": "README.md",
    "chars": 9205,
    "preview": "<!-- README.md is generated from README.Rmd. Please edit that file -->\n\n# textreuse\n\n[![CRAN\\_Status\\_Badge](https://www"
  },
  {
    "path": "_pkgdown.yml",
    "chars": 156,
    "preview": "url: https://docs.ropensci.org/textreuse/\n\ntemplate:\n  bootstrap: 5\n  bootswatch: united\n\nauthors:\n  Yaoxiang Li:\n    hr"
  },
  {
    "path": "appveyor.yml",
    "chars": 807,
    "preview": "# DO NOT CHANGE the \"init\" and \"install\" sections below\n\n# Download script file from GitHub\ninit:\n  ps: |\n        $Error"
  },
  {
    "path": "cran-comments.md",
    "chars": 1357,
    "preview": "This is a new release with bug fixes, documentation refreshes, and helper\nfunctions added after a long maintenance inter"
  },
  {
    "path": "inst/extdata/ats/calltounconv00baxt.txt",
    "chars": 734218,
    "preview": "\nGlass. \n\n\n\nBook,____._._ \n\n\n\n(ttmmtixc \n\n\n\nmm¥m %m(m>m \n\n\n\nv'OJj \n\n\n\n\nA \n\nCALL \n\nTO \n\nTHE TTNCONVEETS9. \n\n\n\nBY REV. RIC"
  },
  {
    "path": "inst/extdata/ats/gospeltruth00whit.txt",
    "chars": 86701,
    "preview": "\n\n\nCvCC \n\n\n\na c: CIS: \n\n■ <£ <tC. <L_C \n\nc cc\" «C C \n\na cc c~;c \n\nccc <:;cr \n\nc «c o \n\nC <CI C \n\nC C <ZT \n\nc c <n< \n\nc c"
  },
  {
    "path": "inst/extdata/ats/lifeofrevrichard00baxt.txt",
    "chars": 251406,
    "preview": "T H E L I F E \n\n\n\nor \n\n\n\n\nREV. RICHARD BAXTER. \n\n\n\nODEFLT COMPILED FROM HIS OWN WRITIKOS. \n\n\n\nPUBLISHED BY THE \n\nAMERICA"
  },
  {
    "path": "inst/extdata/ats/memoirjamesbrai00ricegoog.txt",
    "chars": 727439,
    "preview": "Google \n\n\n\nThis is a digital copy of a book that was preserved for generations on library shelves before it was carefull"
  },
  {
    "path": "inst/extdata/ats/practicalthought00nev.txt",
    "chars": 684042,
    "preview": "LIBEAEY \n\n^biological £eminarg, \n\n\n\nPRINCETON, N. J. \n:ion . \n\n\n\nNo. Case, \nNo. Shelf ^ \nNo. Book, \n\n\n\n\n- \n\n\n\n\n/077¥ \n\n\n"
  },
  {
    "path": "inst/extdata/ats/remember00palm.txt",
    "chars": 64971,
    "preview": "Remember \nBy \nRat Palmer. \nBoston: \n\nTHE AMERICAN TRACT SOCI] \n\nDepositories, 28 Cornhill, Boston ; and 13 Biblb House, "
  },
  {
    "path": "inst/extdata/ats/remembermeorholy00palm.txt",
    "chars": 66508,
    "preview": "//Wf \n\n\n\n'/^y /L.-*^ \n\n\n\nJ?i.{^, \n\n\n\nZHf^ \n\n\n\nt/V. \n\n\n\nRemember Me; \n\n\n\nOR, \n\n\n\n\n\n\nBy \n\nRa2^ Palmer. \n\n\n\nJS js 1 n : \n\nI"
  },
  {
    "path": "inst/extdata/ats/thoughtsonpopery00nevi.txt",
    "chars": 357596,
    "preview": "L I B H -A. n \"sr \n\nPBINCETOK. y. J. \nThe Stephen Collins Donation. \n\n\n\nNo. Casc^ \n\n\n\nDivj^ip \n\n\n\nNo. ^^^^(A_Sjecti0nJk "
  },
  {
    "path": "inst/extdata/legal/ca1851-match.txt",
    "chars": 4019,
    "preview": "§ 4. Every action shall be prosecuted in the name of the real party\nin interest, except as otherwise provided in this A"
  },
  {
    "path": "inst/extdata/legal/ca1851-nomatch.txt",
    "chars": 1246,
    "preview": "§ 313. If the judgment be rendered upon the right of the person so\nalleged to be entitled, and the same be in favor of "
  },
  {
    "path": "inst/extdata/legal/ny1850-match.txt",
    "chars": 4437,
    "preview": "§ 597. Every action must be prosecuted in the name\nof the real party in interest, except as otherwise provided in secti"
  },
  {
    "path": "man/TextReuseCorpus.Rd",
    "chars": 3579,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/TextReuseCorpus.R\n\\name{TextReuseCorpus}\n\\"
  },
  {
    "path": "man/TextReuseTextDocument-accessors.Rd",
    "chars": 713,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/TextReuseTextDocument.R\n\\name{TextReuseTex"
  },
  {
    "path": "man/TextReuseTextDocument.Rd",
    "chars": 3821,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/TextReuseTextDocument.R\n\\name{TextReuseTex"
  },
  {
    "path": "man/align_local.Rd",
    "chars": 3783,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/align_local.R\n\\name{align_local}\n\\alias{al"
  },
  {
    "path": "man/as.matrix.textreuse_candidates.Rd",
    "chars": 654,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/conversion-functions.R\n\\name{as.matrix.tex"
  },
  {
    "path": "man/filenames.Rd",
    "chars": 793,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/filenames.R\n\\name{filenames}\n\\alias{filena"
  },
  {
    "path": "man/hash_string.Rd",
    "chars": 426,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/RcppExports.R\n\\name{hash_string}\n\\alias{ha"
  },
  {
    "path": "man/lsh.Rd",
    "chars": 3249,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/lsh.R\n\\name{lsh}\n\\alias{lsh}\n\\title{Locali"
  },
  {
    "path": "man/lsh_add.Rd",
    "chars": 1119,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/lsh.R\n\\name{lsh_add}\n\\alias{lsh_add}\n\\titl"
  },
  {
    "path": "man/lsh_candidates.Rd",
    "chars": 796,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/lsh_candidates.R\n\\name{lsh_candidates}\n\\al"
  },
  {
    "path": "man/lsh_compare.Rd",
    "chars": 1716,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/lsh_compare.R\n\\name{lsh_compare}\n\\alias{ls"
  },
  {
    "path": "man/lsh_probability.Rd",
    "chars": 1997,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/lsh_probability.R\n\\name{lsh_probability}\n\\"
  },
  {
    "path": "man/lsh_query.Rd",
    "chars": 1044,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/lsh_query.R\n\\name{lsh_query}\n\\alias{lsh_qu"
  },
  {
    "path": "man/lsh_subset.Rd",
    "chars": 878,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/lsh_subset.R\n\\name{lsh_subset}\n\\alias{lsh_"
  },
  {
    "path": "man/minhash_generator.Rd",
    "chars": 1925,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/minhash.R\n\\name{minhash_generator}\n\\alias{"
  },
  {
    "path": "man/pairwise_candidates.Rd",
    "chars": 1112,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/pairwise_candidates.R\n\\name{pairwise_candi"
  },
  {
    "path": "man/pairwise_compare.Rd",
    "chars": 2198,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/pairwise_compare.R\n\\name{pairwise_compare}"
  },
  {
    "path": "man/reexports.Rd",
    "chars": 568,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/TextReuseTextDocument.R\n\\docType{import}\n\\"
  },
  {
    "path": "man/rehash.Rd",
    "chars": 1262,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/rehash.R\n\\name{rehash}\n\\alias{rehash}\n\\tit"
  },
  {
    "path": "man/similarity-functions.Rd",
    "chars": 4048,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/similarity.R\n\\name{similarity-functions}\n\\"
  },
  {
    "path": "man/textreuse-package.Rd",
    "chars": 2744,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/textreuse-package.R\n\\docType{package}\n\\nam"
  },
  {
    "path": "man/token_index.Rd",
    "chars": 963,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/token_index.R\n\\name{token_index}\n\\alias{to"
  },
  {
    "path": "man/token_index_candidates.Rd",
    "chars": 478,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/token_index.R\n\\name{token_index_candidates"
  },
  {
    "path": "man/tokenize.Rd",
    "chars": 1450,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/tokenize.R\n\\name{tokenize}\n\\alias{tokenize"
  },
  {
    "path": "man/tokenizers.Rd",
    "chars": 1282,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/tokenizers.R\n\\name{tokenizers}\n\\alias{toke"
  },
  {
    "path": "man/wordcount.Rd",
    "chars": 525,
    "preview": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/wordcount.R\n\\name{wordcount}\n\\alias{wordco"
  },
  {
    "path": "pkgdown/extra.css",
    "chars": 444,
    "preview": ":root {\n  --bs-body-font-size: 16px;\n  --bs-body-line-height: 1.5;\n}\n\nbody,\n.contents,\n.template-home,\n.template-article"
  },
  {
    "path": "src/RcppExports.cpp",
    "chars": 3186,
    "preview": "// Generated by using Rcpp::compileAttributes() -> do not edit by hand\n// Generator token: 10BE3573-1514-4C36-9D1C-5A225"
  },
  {
    "path": "src/hash_string.cpp",
    "chars": 654,
    "preview": "#include <Rcpp.h>\n#include <boost/functional/hash.hpp>\nusing namespace Rcpp;\n\n//' Hash a string to an integer\n//' @param"
  },
  {
    "path": "src/shingle_ngrams.cpp",
    "chars": 618,
    "preview": "#include <Rcpp.h>\nusing namespace Rcpp;\n\n// Create shingled n-grams\n// [[Rcpp::export]]\nCharacterVector shingle_ngrams(C"
  },
  {
    "path": "src/skip_ngrams.cpp",
    "chars": 1496,
    "preview": "#include <Rcpp.h>\nusing namespace Rcpp;\n\n// Skip n-grams\n// @param n = number of words in an n-gram\n// @param k = max nu"
  },
  {
    "path": "src/sw_matrix.cpp",
    "chars": 908,
    "preview": "#include <progress.hpp>\n#include <Rcpp.h>\nusing namespace Rcpp;\n\n// [[Rcpp::depends(RcppProgress)]]\n// [[Rcpp::export]]\n"
  },
  {
    "path": "tests/testthat/newman.txt",
    "chars": 2015,
    "preview": "And now that I am about to trace, as far as I can, the course of that \ngreat revolution of mind, which led me to leave m"
  },
  {
    "path": "tests/testthat/test-TextReuseCorpus.R",
    "chars": 5539,
    "preview": "context(\"TextReuseCorpus\")\n\nny <- system.file(\"extdata/legal/ny1850-match.txt\", package = \"textreuse\")\nca1 <- system.fil"
  },
  {
    "path": "tests/testthat/test-TextReuseTextDocument.R",
    "chars": 4411,
    "preview": "context(\"TextReuseTextDocument\")\n\ndoc <- TextReuseTextDocument(file = \"newman.txt\", keep_tokens = TRUE)\ntest_meta <- lis"
  },
  {
    "path": "tests/testthat/test-alignment.R",
    "chars": 1635,
    "preview": "context(\"Alignment\")\n\ntest_that(\"returns correct results with edits properly marked\", {\n  a <- \"How can we tell if this "
  },
  {
    "path": "tests/testthat/test-candidate_pairs.R",
    "chars": 1075,
    "preview": "context(\"Candidate pairs\")\ndir <- system.file(\"extdata/legal\", package = \"textreuse\")\ncorpus <- TextReuseCorpus(dir = di"
  },
  {
    "path": "tests/testthat/test-filenames.R",
    "chars": 338,
    "preview": "context(\"Filenames\")\n\npaths <- c(\"corpus/one.txt\", \"deep/corpus/two.R\", \"~/home/three.markdown\",\n           \"/corpus/fou"
  },
  {
    "path": "tests/testthat/test-hashing.R",
    "chars": 562,
    "preview": "context(\"Hashing\")\n\nlines  <- system.file(\"extdata/legal/ny1850-match.txt\", package = \"textreuse\") %>%\n  readLines()\nngr"
  },
  {
    "path": "tests/testthat/test-jaccard.R",
    "chars": 1186,
    "preview": "context(\"Jaccard coefficients\")\n\ntest_that(\"calculates the similarity coefficient correctly\", {\n  expect_equal(jaccard_s"
  },
  {
    "path": "tests/testthat/test-lsh.R",
    "chars": 2887,
    "preview": "context(\"LSH\")\n\ndir <- system.file(\"extdata/legal\", package = \"textreuse\")\nminhash <- minhash_generator(200, seed = 9228"
  },
  {
    "path": "tests/testthat/test-minhash.R",
    "chars": 1088,
    "preview": "context(\"Minhash\")\n\nmhash <- minhash_generator()\nfile <- system.file(\"extdata/legal/ny1850-match.txt\", package = \"textre"
  },
  {
    "path": "tests/testthat/test-pairwise_cf.R",
    "chars": 756,
    "preview": "context(\"Pairwise comparison\")\n\ndir <- system.file(\"extdata/legal\", package = \"textreuse\")\ncorpus <- TextReuseCorpus(dir"
  },
  {
    "path": "tests/testthat/test-ratio_of_matches.R",
    "chars": 1771,
    "preview": "context(\"Ratio of matches\")\n\ntest_that(\"calculates the value correctly\", {\n  expect_equal(ratio_of_matches(1:4, 3:5), 2/"
  },
  {
    "path": "tests/testthat/test-token_index.R",
    "chars": 1018,
    "preview": "context(\"Token index\")\n\ntexts <- c(a = \"one two three four\",\n           b = \"one two three five\",\n           c = \"six se"
  },
  {
    "path": "tests/testthat/test-tokenizers.R",
    "chars": 3035,
    "preview": "context(\"Tokenizers\")\n\nsentence <- \"This is a sentence which has a number of words in it; also some\n             tricky "
  },
  {
    "path": "tests/testthat/test-utils.R",
    "chars": 955,
    "preview": "context(\"Utils\")\n\ntest_that(\"as_string returns the correct type\", {\n  s <- as_string(c(\"First\", \"Second\"))\n  expect_is(s"
  },
  {
    "path": "tests/testthat/test-wordcount.R",
    "chars": 517,
    "preview": "context(\"Word counts\")\n\ndir <- system.file(\"extdata/legal\", package = \"textreuse\")\ncorpus <- TextReuseCorpus(dir = dir)\n"
  },
  {
    "path": "tests/testthat.R",
    "chars": 62,
    "preview": "library(testthat)\nlibrary(textreuse)\n\ntest_check(\"textreuse\")\n"
  },
  {
    "path": "vignettes/textreuse-alignment.Rmd",
    "chars": 2126,
    "preview": "---\ntitle: \"Text Alignment\"\nauthor:\n  - \"Lincoln Mullen\"\n  - \"Yaoxiang Li\"\ndate: \"`r Sys.Date()`\"\noutput: rmarkdown::htm"
  },
  {
    "path": "vignettes/textreuse-introduction.Rmd",
    "chars": 8488,
    "preview": "---\ntitle: \"Introduction to the textreuse package\"\nauthor:\n  - \"Lincoln Mullen\"\n  - \"Yaoxiang Li\"\ndate: \"`r Sys.Date()`\""
  },
  {
    "path": "vignettes/textreuse-minhash.Rmd",
    "chars": 6516,
    "preview": "---\ntitle: \"Minhash and locality-sensitive hashing\"\nauthor:\n  - \"Lincoln Mullen\"\n  - \"Yaoxiang Li\"\ndate: \"`r Sys.Date()`"
  },
  {
    "path": "vignettes/textreuse-pairwise.Rmd",
    "chars": 2248,
    "preview": "---\ntitle: \"Pairwise comparisons for document similarity\"\nauthor:\n  - \"Lincoln Mullen\"\n  - \"Yaoxiang Li\"\ndate: \"`r Sys.D"
  }
]

About this extraction

This page contains the full source code of the ropensci/textreuse GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 102 files (3.0 MB), approximately 801.6k tokens, and a symbol index with 9 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!