Showing preview only (3,212K chars total). Download the full file or copy to clipboard to get everything.
Repository: ropensci/textreuse
Branch: master
Commit: 6f8cbe380295
Files: 102
Total size: 3.0 MB
Directory structure:
gitextract_vbmxaw27/
├── .Rbuildignore
├── .gitignore
├── .travis.yml
├── CONDUCT.md
├── DESCRIPTION
├── LICENSE
├── Makefile
├── NAMESPACE
├── NEWS.md
├── R/
│ ├── RcppExports.R
│ ├── TextReuseCorpus.R
│ ├── TextReuseTextDocument.R
│ ├── align_local.R
│ ├── conversion-functions.R
│ ├── filenames.R
│ ├── lsh.R
│ ├── lsh_candidates.R
│ ├── lsh_compare.R
│ ├── lsh_probability.R
│ ├── lsh_query.R
│ ├── lsh_subset.R
│ ├── minhash.R
│ ├── pairwise_candidates.R
│ ├── pairwise_compare.R
│ ├── parallel.R
│ ├── rehash.R
│ ├── similarity.R
│ ├── textreuse-package.r
│ ├── token_index.R
│ ├── tokenize.R
│ ├── tokenizers.R
│ ├── utils.R
│ └── wordcount.R
├── README.Rmd
├── README.md
├── _pkgdown.yml
├── appveyor.yml
├── cran-comments.md
├── inst/
│ └── extdata/
│ ├── ats/
│ │ ├── calltounconv00baxt.txt
│ │ ├── gospeltruth00whit.txt
│ │ ├── lifeofrevrichard00baxt.txt
│ │ ├── memoirjamesbrai00ricegoog.txt
│ │ ├── practicalthought00nev.txt
│ │ ├── remember00palm.txt
│ │ ├── remembermeorholy00palm.txt
│ │ └── thoughtsonpopery00nevi.txt
│ └── legal/
│ ├── ca1851-match.txt
│ ├── ca1851-nomatch.txt
│ └── ny1850-match.txt
├── man/
│ ├── TextReuseCorpus.Rd
│ ├── TextReuseTextDocument-accessors.Rd
│ ├── TextReuseTextDocument.Rd
│ ├── align_local.Rd
│ ├── as.matrix.textreuse_candidates.Rd
│ ├── filenames.Rd
│ ├── hash_string.Rd
│ ├── lsh.Rd
│ ├── lsh_add.Rd
│ ├── lsh_candidates.Rd
│ ├── lsh_compare.Rd
│ ├── lsh_probability.Rd
│ ├── lsh_query.Rd
│ ├── lsh_subset.Rd
│ ├── minhash_generator.Rd
│ ├── pairwise_candidates.Rd
│ ├── pairwise_compare.Rd
│ ├── reexports.Rd
│ ├── rehash.Rd
│ ├── similarity-functions.Rd
│ ├── textreuse-package.Rd
│ ├── token_index.Rd
│ ├── token_index_candidates.Rd
│ ├── tokenize.Rd
│ ├── tokenizers.Rd
│ └── wordcount.Rd
├── pkgdown/
│ └── extra.css
├── src/
│ ├── RcppExports.cpp
│ ├── hash_string.cpp
│ ├── shingle_ngrams.cpp
│ ├── skip_ngrams.cpp
│ └── sw_matrix.cpp
├── tests/
│ ├── testthat/
│ │ ├── newman.txt
│ │ ├── test-TextReuseCorpus.R
│ │ ├── test-TextReuseTextDocument.R
│ │ ├── test-alignment.R
│ │ ├── test-candidate_pairs.R
│ │ ├── test-filenames.R
│ │ ├── test-hashing.R
│ │ ├── test-jaccard.R
│ │ ├── test-lsh.R
│ │ ├── test-minhash.R
│ │ ├── test-pairwise_cf.R
│ │ ├── test-ratio_of_matches.R
│ │ ├── test-token_index.R
│ │ ├── test-tokenizers.R
│ │ ├── test-utils.R
│ │ └── test-wordcount.R
│ └── testthat.R
└── vignettes/
├── textreuse-alignment.Rmd
├── textreuse-introduction.Rmd
├── textreuse-minhash.Rmd
└── textreuse-pairwise.Rmd
================================================
FILE CONTENTS
================================================
================================================
FILE: .Rbuildignore
================================================
^.*\.Rproj$
^\.Rproj\.user$
^\.git$
^\.r-lib$
^README\.Rmd$
^README-*\.png$
^data-raw$
^\.travis\.yml$
wordnet
^appveyor\.yml$
^CONDUCT\.md$
^cran-comments\.md$
^Makefile$
^_pkgdown\.yml$
^pkgdown$
docs/
================================================
FILE: .gitignore
================================================
.Rproj
*.Rproj
.Rproj.user
.Rhistory
.RData
.Ruserdata
src/*.o
src/*.so
src/*.dll
================================================
FILE: .travis.yml
================================================
language: r
r:
- oldrel
- release
- devel
sudo: false
cache: packages
after_success:
- Rscript -e 'covr::codecov()'
notifications:
email:
on_success: change
on_failure: change
slack:
secure: gxP5b9VO52sKP72YB1iFwt5U73s6O1nq9o1vH6ddrvEIRgpzSQO7lIH8/KYfjj+eFRXCIWtFnrkar2kw2sfGJVERnJ9R13XtVDc23tApkZjacTxHUov39WbS4zI03Tb9pX86ywUNcs0rhVKok3CD9V80fybd3nFy8Vy/ugSBp7s=
================================================
FILE: CONDUCT.md
================================================
# Contributor Code of Conduct
As contributors and maintainers of this project, we pledge to respect all people who
contribute through reporting issues, posting feature requests, updating documentation,
submitting pull requests or patches, and other activities.
We are committed to making participation in this project a harassment-free experience for
everyone, regardless of level of experience, gender, gender identity and expression,
sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion.
Examples of unacceptable behavior by participants include the use of sexual language or
imagery, derogatory comments or personal attacks, trolling, public or private harassment,
insults, or other unprofessional conduct.
Project maintainers have the right and responsibility to remove, edit, or reject comments,
commits, code, wiki edits, issues, and other contributions that are not aligned to this
Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed
from the project team.
Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by
opening an issue or contacting one or more of the project maintainers.
This Code of Conduct is adapted from the Contributor Covenant
(http:contributor-covenant.org), version 1.0.0, available at
http://contributor-covenant.org/version/1/0/0/
================================================
FILE: DESCRIPTION
================================================
Package: textreuse
Type: Package
Title: Detect Text Reuse and Document Similarity
Version: 1.0.1
Date: 2026-05-06
Authors@R: c(
person("Lincoln", "Mullen", role = "aut",
comment = c(ORCID = "0000-0001-5103-6917")
),
person("Yaoxiang", "Li", role = c("aut", "cre"),
email = "liyaoxiang@outlook.com",
comment = c(ORCID = "0000-0001-9200-1016")))
Description: Tools for measuring similarity among documents and detecting
passages which have been reused. Implements shingled n-gram, skip n-gram,
and other tokenizers; similarity/dissimilarity functions; pairwise
comparisons; minhash and locality sensitive hashing algorithms; and a
version of the Smith-Waterman local alignment algorithm suitable for
natural language.
License: MIT + file LICENSE
LazyData: TRUE
URL: https://docs.ropensci.org/textreuse/,
https://github.com/ropensci/textreuse
BugReports: https://github.com/ropensci/textreuse/issues
VignetteBuilder: knitr
Depends:
R (>= 3.1.1)
Imports:
assertthat (>= 0.1),
digest (>= 0.6.8),
dplyr (>= 0.8.0),
NLP (>= 0.1.8),
Matrix,
Rcpp (>= 0.12.0),
RcppProgress (>= 0.1),
stringr (>= 1.0.0),
tibble (>= 3.0.1),
tidyr (>= 1.0.0)
Suggests:
testthat (>= 0.11.0),
knitr (>= 1.11),
rmarkdown (>= 0.8),
covr
LinkingTo: BH, Rcpp, RcppProgress
RoxygenNote: 7.3.2
Encoding: UTF-8
================================================
FILE: LICENSE
================================================
YEAR: 2026
COPYRIGHT HOLDER: Yaoxiang Li and Lincoln Mullen
================================================
FILE: Makefile
================================================
.PHONY : docs deploy-docs
docs :
Rscript -e "pkgdown::clean_site(); pkgdown::build_site(run_dont_run = TRUE)"
deploy-docs :
@echo "Documentation is published by rOpenSci at https://docs.ropensci.org/textreuse/"
================================================
FILE: NAMESPACE
================================================
# Generated by roxygen2: do not edit by hand
S3method("[",TextReuseCorpus)
S3method("[[",TextReuseCorpus)
S3method("content<-",TextReuseTextDocument)
S3method("hashes<-",TextReuseTextDocument)
S3method("meta<-",TextReuseCorpus)
S3method("meta<-",TextReuseTextDocument)
S3method("minhashes<-",TextReuseTextDocument)
S3method("names<-",TextReuseCorpus)
S3method("tokens<-",TextReuseTextDocument)
S3method(align_local,TextReuseTextDocument)
S3method(align_local,default)
S3method(as.character,TextReuseTextDocument)
S3method(as.matrix,textreuse_candidates)
S3method(content,TextReuseTextDocument)
S3method(count_matches,TextReuseTextDocument)
S3method(count_matches,default)
S3method(hashes,TextReuseCorpus)
S3method(hashes,TextReuseTextDocument)
S3method(jaccard_bag_similarity,TextReuseTextDocument)
S3method(jaccard_bag_similarity,default)
S3method(jaccard_dissimilarity,default)
S3method(jaccard_similarity,TextReuseTextDocument)
S3method(jaccard_similarity,default)
S3method(length,TextReuseCorpus)
S3method(lsh,TextReuseCorpus)
S3method(lsh,TextReuseTextDocument)
S3method(matching_tokens,TextReuseTextDocument)
S3method(matching_tokens,default)
S3method(meta,TextReuseCorpus)
S3method(meta,TextReuseTextDocument)
S3method(minhashes,TextReuseCorpus)
S3method(minhashes,TextReuseTextDocument)
S3method(names,TextReuseCorpus)
S3method(print,TextReuseCorpus)
S3method(print,TextReuseTextDocument)
S3method(print,textreuse_alignment)
S3method(ratio_of_matches,TextReuseTextDocument)
S3method(ratio_of_matches,default)
S3method(rehash,TextReuseCorpus)
S3method(rehash,TextReuseTextDocument)
S3method(tokenize,TextReuseCorpus)
S3method(tokenize,TextReuseTextDocument)
S3method(tokens,TextReuseCorpus)
S3method(tokens,TextReuseTextDocument)
S3method(wordcount,TextDocument)
S3method(wordcount,TextReuseCorpus)
S3method(wordcount,default)
export("content<-")
export("hashes<-")
export("meta<-")
export("minhashes<-")
export("tokens<-")
export(TextReuseCorpus)
export(TextReuseTextDocument)
export(align_local)
export(as_sparse_matrix)
export(content)
export(count_matches)
export(filenames)
export(has_content)
export(has_hashes)
export(has_minhashes)
export(has_tokens)
export(hash_string)
export(hashes)
export(is.TextReuseCorpus)
export(is.TextReuseTextDocument)
export(jaccard_bag_similarity)
export(jaccard_dissimilarity)
export(jaccard_similarity)
export(lsh)
export(lsh_add)
export(lsh_candidates)
export(lsh_compare)
export(lsh_probability)
export(lsh_query)
export(lsh_subset)
export(lsh_threshold)
export(matching_tokens)
export(meta)
export(minhash_generator)
export(minhashes)
export(pairwise_candidates)
export(pairwise_compare)
export(ratio_of_matches)
export(rehash)
export(skipped)
export(token_index)
export(token_index_candidates)
export(tokenize)
export(tokenize_ngrams)
export(tokenize_sentences)
export(tokenize_skip_ngrams)
export(tokenize_words)
export(tokens)
export(wordcount)
import(RcppProgress)
import(assertthat)
import(stringr)
importFrom(NLP,"content<-")
importFrom(NLP,"meta<-")
importFrom(NLP,content)
importFrom(NLP,meta)
importFrom(Rcpp,sourceCpp)
importFrom(utils,getTxtProgressBar)
importFrom(utils,setTxtProgressBar)
importFrom(utils,txtProgressBar)
useDynLib(textreuse, .registration = TRUE)
================================================
FILE: NEWS.md
================================================
# textreuse 1.0.1
This release brings together several years of maintenance and feature work to
make textreuse easier to use on current R installations and more practical for
larger document collections.
This is a CRAN resubmission that fixes a moved README URL reported by CRAN
incoming checks.
## Text input and corpus construction
- `TextReuseTextDocument()` and `TextReuseCorpus()` now accept an `encoding`
argument, making it easier to read source files whose text encoding is known
or differs from the platform default.
- `TextReuseCorpus()` now keeps skipped-document bookkeeping deterministic.
Skipped documents are reported consistently, and skip metadata is available
even when `skip_short = FALSE`.
- Very short documents are handled more predictably when skip n-grams are used,
avoiding assertion failures and making corpus construction easier to diagnose.
## Alignment and match inspection
- `align_local()` now returns an empty local alignment instead of throwing an
error when two texts have no matching words. This makes batch alignment
workflows easier to run because no-match pairs can be represented directly.
- `align_local()` gains `preserve_punctuation`, allowing displayed alignments to
keep punctuation from the original texts when that context is useful.
- New `count_matches()` and `matching_tokens()` helpers expose absolute match
counts and the matched tokens themselves, so users can inspect what drove a
similarity score rather than relying only on a ratio.
## Candidate generation and comparison
- New token-index helpers find candidate document pairs from shared n-grams,
giving users another way to identify likely reuse pairs before running more
expensive comparisons.
- `pairwise_candidates()` and matrix conversion now preserve all document IDs,
including documents without returned candidate pairs.
- `as_sparse_matrix()` provides a sparse matrix representation of candidate
results, which is more convenient for downstream modeling, graph analysis, and
workflows with many documents.
## Locality-sensitive hashing
- `lsh_add()` can add new documents to an existing LSH bucket cache, so users can
extend an index without rebuilding it from scratch.
- `lsh_compare()` can run comparisons in parallel on non-Windows platforms when
`options(mc.cores)` is set.
- Long-running C++ hashing and n-gram loops now check for user interrupts, so
expensive jobs can be stopped more cleanly from R.
## Compatibility and documentation
- Compatibility with current dplyr and tidyr releases has been refreshed.
- README, vignette, reference, and pkgdown examples were regenerated against
current package output.
- Stale external links and documentation badges were updated so package checks
and the public documentation site are cleaner.
# textreuse 0.1.4
- Preventative maintenance release to avoid failing tests when new version of
BH is released.
# textreuse 0.1.3
- Preventative maintenance release to avoid failing tests when new versions of
the dplyr and testthat packages are released.
# textreuse 0.1.2
- Fix memory error in `shingle_ngrams()`
- Fix tests for retokenizing on Windows
- More informative error message if using `lsh()` on corpora without minhashes
# textreuse 0.1.1
- Fix progress bars in vignettes
# textreuse 0.1.0
- Initial release
================================================
FILE: R/RcppExports.R
================================================
# Generated by using Rcpp::compileAttributes() -> do not edit by hand
# Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393
#' Hash a string to an integer
#' @param x A character vector to be hashed.
#' @return A vector of integer hashes.
#' @examples
#' s <- c("How", "many", "roads", "must", "a", "man", "walk", "down")
#' hash_string(s)
#' @export
hash_string <- function(x) {
.Call(`_textreuse_hash_string`, x)
}
shingle_ngrams <- function(words, n) {
.Call(`_textreuse_shingle_ngrams`, words, n)
}
skip_ngrams <- function(words, n, k) {
.Call(`_textreuse_skip_ngrams`, words, n, k)
}
sw_matrix <- function(m, a, b, match, mismatch, gap, progress) {
.Call(`_textreuse_sw_matrix`, m, a, b, match, mismatch, gap, progress)
}
================================================
FILE: R/TextReuseCorpus.R
================================================
#' TextReuseCorpus
#'
#' This is the constructor function for a \code{TextReuseCorpus}, modeled on the
#' virtual S3 class \code{Corpus} from the \code{tm} package. The
#' object is a \code{TextReuseCorpus}, which is basically a list containing
#' objects of class \code{\link{TextReuseTextDocument}}. Arguments are passed
#' along to that constructor function. To create the corpus, you can pass either
#' a character vector of paths to text files using the \code{paths =} parameter,
#' a directory containing text files (with any extension) using the \code{dir =}
#' parameter, or a character vector of documents using the \code{text = }
#' parameter, where each element in the characer vector is a document. If the
#' character vector passed to \code{text = } has names, then those names will be
#' used as the document IDs. Otherwise, IDs will be assigned to the documents.
#' Only one of the \code{paths}, \code{dir}, or \code{text} parameters should be
#' specified.
#'
#' @details If \code{skip_short = TRUE}, this function will skip very short or
#' empty documents. A very short document is one where there are too few words
#' to create at least two n-grams. For example, if five-grams are desired,
#' then a document must be at least six words long. If no value of \code{n} is
#' provided, then the function assumes a value of \code{n = 3}. A warning will
#' be printed with the document ID of each skipped document. Use
#' \code{skipped()} to get the IDs of skipped documents.
#'
#' This function will use multiple cores on non-Windows machines if the
#' \code{"mc.cores"} option is set. For example, to use four cores:
#' \code{options("mc.cores" = 4L)}.
#'
#' @param paths A character vector of paths to files to be opened.
#' @param dir The path to a directory of text files.
#' @param text A character vector (possibly named) of documents.
#' @param meta A list with named elements for the metadata associated with this
#' corpus.
#' @param progress Display a progress bar while loading files.
#' @param tokenizer A function to split the text into tokens. See
#' \code{\link{tokenizers}}. If value is \code{NULL}, then tokenizing and
#' hashing will be skipped.
#' @param ... Arguments passed on to the \code{tokenizer}.
#' @param hash_func A function to hash the tokens. See
#' \code{\link{hash_string}}.
#' @param minhash_func A function to create minhash signatures of the document.
#' See \code{\link{minhash_generator}}.
#' @param keep_tokens Should the tokens be saved in the documents that are
#' returned or discarded?
#' @param keep_text Should the text be saved in the documents that are returned
#' or discarded?
#' @param skip_short Should short documents be skipped? (See details.)
#' @param encoding Encoding to be used when reading files.
#'
#' @seealso \link[=TextReuseTextDocument-accessors]{Accessors for TextReuse
#' objects}.
#'
#' @examples
#' dir <- system.file("extdata/legal", package = "textreuse")
#' corpus <- TextReuseCorpus(dir = dir, meta = list("description" = "Field Codes"))
#' # Subset by position or file name
#' corpus[[1]]
#' names(corpus)
#' corpus[["ca1851-match"]]
#'
#' @export
TextReuseCorpus <- function(paths, dir = NULL, text = NULL, meta = list(),
progress = interactive(),
tokenizer = tokenize_ngrams, ...,
hash_func = hash_string,
minhash_func = NULL,
keep_tokens = FALSE,
keep_text = TRUE,
skip_short = TRUE,
encoding = "unknown") {
if (!is.null(tokenizer)) {
assert_that(is.function(tokenizer),
is.function(hash_func))
tokenizer_name <- as.character(substitute(tokenizer))
hash_func_name <- as.character(substitute(hash_func))
if (!is.null(minhash_func)) {
minhash_func_name <- as.character(substitute(minhash_func))
} else {
minhash_func_name <- NULL
}
loading_msg <- "Loading, tokenizing, and hashing "
} else {
tokenizer_name <- NULL
hash_func_name <- NULL
minhash_func_name <- NULL
loading_msg <- "Loading "
}
apply_func <- get_apply_function()
# If we get a character vector of documents, use that; otherwise load
# the files from disk.
if (!missing(text)) {
assert_that(missing(paths),
is.null(dir),
is.character(text))
if (progress) {
len <- length(text)
message(loading_msg, prettyNum(len, big.mark = ","), " documents.")
if (using_parallel())
progress <- FALSE
else
pb <- txtProgressBar(min = 0, max = len, style = 3)
}
if (is.null(names(text)))
names(text) <- str_c("doc-", 1:length(text))
docs <- apply_func(seq_along(text), function(i) {
d <- TextReuseTextDocument(text = text[i],
tokenizer = tokenizer, ...,
hash_func = hash_func,
minhash_func = minhash_func,
keep_tokens = keep_tokens,
keep_text = keep_text,
skip_short = skip_short,
encoding = encoding,
meta = list(id = names(text)[i],
tokenizer = tokenizer_name,
hash_func = hash_func_name,
minhash_func = minhash_func_name))
if (progress) setTxtProgressBar(pb, i)
d
})
if (progress) close(pb)
names(docs) <- names(text)
} else {
if (missing(paths) & !is.null(dir)) {
assert_that(is.dir(dir))
paths <- Sys.glob(str_c(dir, "/*"))
}
vapply(paths, is.readable, logical(1), USE.NAMES = FALSE)
if (progress) {
len <- length(paths)
message(loading_msg, prettyNum(len, big.mark = ","), " documents.")
if (using_parallel())
progress <- FALSE
else
pb <- txtProgressBar(min = 0, max = len, style = 3)
}
docs <- apply_func(seq_along(paths), function(i) {
d <- TextReuseTextDocument(file = paths[i], tokenizer = tokenizer, ...,
hash_func = hash_func,
minhash_func = minhash_func,
keep_tokens = keep_tokens,
keep_text = keep_text,
skip_short = skip_short,
encoding = encoding,
meta = list(tokenizer = tokenizer_name,
hash_func = hash_func_name,
minhash_func = minhash_func_name))
if (progress) setTxtProgressBar(pb, i)
d
})
if (progress) close(pb)
names(docs) <- filenames(paths)
}
skipped <- character()
# Filter documents that were skipped because they were too short
if (skip_short) {
skipped_docs <- vapply(docs, is.null, logical(1))
skipped <- names(docs)[skipped_docs]
docs <- docs[!skipped_docs]
if (length(skipped) > 0)
warning("Skipped ", length(skipped), " documents that were too short. ",
"Use `skipped()` to get their IDs.")
}
assert_that(is.list(meta))
meta$tokenizer <- tokenizer_name
meta$hash_func <- hash_func_name
meta$minhash_func <- minhash_func_name
if (!is.null(names(meta))) meta <- sort_meta(meta)
corpus <- list(documents = docs, meta = meta)
class(corpus) <- c("TextReuseCorpus", "Corpus")
attr(corpus, "skipped") <- skipped
corpus
}
#' @export
meta.TextReuseCorpus <- function(x, tag = NULL, ...) {
if (is.null(tag))
x$meta
else
x$meta[[tag]]
}
#' @export
`meta<-.TextReuseCorpus` <- function(x, tag = NULL, ..., value) {
if (is.null(tag)) {
assert_that(is.list(value))
x$meta <- value
} else {
x$meta[[tag]] <- value
}
x
}
#' @export
print.TextReuseCorpus <- function(x, ...) {
cat("TextReuseCorpus\n")
cat("Number of documents:", length(x), "\n")
pretty_print_metadata(x)
}
#' @export
length.TextReuseCorpus <- function(x) {
length(x$documents)
}
#' @export
`[.TextReuseCorpus` <- function(x, i) {
x$documents <- x$documents[i]
x
}
#' @export
`[[.TextReuseCorpus` <- function(x, i) {
x$documents[[i]]
}
#' @export
names.TextReuseCorpus <- function(x) {
names(x$documents)
}
#' @export
`names<-.TextReuseCorpus` <- function(x, value) {
names(x$documents) <- value
x
}
#' @param x An R object to check.
#' @export
#' @rdname TextReuseCorpus
is.TextReuseCorpus <- function(x) {
inherits(x, "TextReuseCorpus")
}
#' @export
#' @rdname TextReuseCorpus
skipped <- function(x) {
assert_that(is.TextReuseCorpus(x))
attr(x, "skipped", exact = TRUE)
}
================================================
FILE: R/TextReuseTextDocument.R
================================================
#' TextReuseTextDocument
#'
#' This is the constructor function for \code{TextReuseTextDocument} objects.
#' This class is used for comparing documents.
#'
#' @param text A character vector containing the text of the document. This
#' argument can be skipped if supplying \code{file}.
#' @param file The path to a text file, if \code{text} is not provided.
#' @param meta A list with named elements for the metadata associated with this
#' document. If a document is created using the \code{text} parameter, then
#' you must provide an \code{id} field, e.g., \code{meta = list(id =
#' "my_id")}. If the document is created using \code{file}, then the ID will
#' be created from the file name.
#' @param tokenizer A function to split the text into tokens. See
#' \code{\link{tokenizers}}. If value is \code{NULL}, then tokenizing and
#' hashing will be skipped.
#' @param ... Arguments passed on to the \code{tokenizer}.
#' @param hash_func A function to hash the tokens. See
#' \code{\link{hash_string}}.
#' @param minhash_func A function to create minhash signatures of the document.
#' See \code{\link{minhash_generator}}.
#' @param keep_tokens Should the tokens be saved in the document that is
#' returned or discarded?
#' @param keep_text Should the text be saved in the document that is returned or
#' discarded?
#' @param skip_short Should short documents be skipped? (See details.)
#' @param encoding Encoding to be used when reading files.
#'
#' @details This constructor function follows a three-step process. It reads in
#' the text, either from a file or from memory. It then tokenizes that text.
#' Then it hashes the tokens. Most of the comparison functions in this package
#' rely only on the hashes to make the comparison. By passing \code{FALSE} to
#' \code{keep_tokens} and \code{keep_text}, you can avoid saving those
#' objects, which can result in significant memory savings for large corpora.
#'
#' If \code{skip_short = TRUE}, this function will return \code{NULL} for very
#' short or empty documents. A very short document is one where there are too
#' few words to create at least two n-grams. For example, if five-grams are
#' desired, then a document must be at least six words long. If no value of
#' \code{n} is provided, then the function assumes a value of \code{n = 3}. A
#' warning will be printed with the document ID of a skipped document.
#'
#' @return An object of class \code{TextReuseTextDocument}. This object inherits
#' from the virtual S3 class \code{\link[NLP]{TextDocument}} in the NLP
#' package. It contains the following elements: \describe{ \item{content}{The
#' text of the document.} \item{tokens}{The tokens created from the text.}
#' \item{hashes}{Hashes created from the tokens.} \item{minhashes}{The minhash
#' signature of the document.} \item{metadata}{The document metadata,
#' including the filename (if any) in \code{file}.} }
#'
#' @seealso \link[=TextReuseTextDocument-accessors]{Accessors for TextReuse
#' objects}.
#'
#' @examples
#' file <- system.file("extdata/legal/ny1850-match.txt", package = "textreuse")
#' doc <- TextReuseTextDocument(file = file, meta = list(id = "ny1850"))
#' print(doc)
#' meta(doc)
#' head(tokens(doc))
#' head(hashes(doc))
#' \dontrun{
#' content(doc)
#' }
#' @export
TextReuseTextDocument <- function(text, file = NULL, meta = list(),
tokenizer = tokenize_ngrams, ...,
hash_func = hash_string,
minhash_func = NULL,
keep_tokens = FALSE,
keep_text = TRUE,
skip_short = TRUE,
encoding = "unknown") {
if (!missing(text)) assert_that(has_id(meta))
if (!is.null(file)) {
assert_that(missing(text),
is.readable(file))
text <- as_string(readLines(file, encoding = encoding))
}
assert_that(is.character(text))
text <- as_string(text)
# Define document ID early
document_id <- ifelse(is.null(meta$id), filenames(file), meta$id)
# Check length of document
if (skip_short) {
min_words <- short_document_word_minimum(tokenizer, list(...))
if (wordcount(text) < min_words) {
warning("Skipping document with ID '", document_id,
"' because it has too few words ",
"to create tokens with the requested tokenizer.",
call. = FALSE, noBreaks. = TRUE)
return(NULL)
}
}
# Tokenize and hash
if (!is.null(tokenizer)) {
assert_that(is.function(tokenizer))
tokenizer_name <- as.character(substitute(tokenizer))
tokens <- tokenizer(text, ...)
assert_that(is.function(hash_func))
hash_func_name <- as.character(substitute(hash_func))
hashes <- hash_func(tokens)
# Also minhash if requested
if (!is.null(minhash_func)) {
assert_that(is.function(minhash_func))
minhash_func_name <- as.character(substitute(minhash_func))
minhashes <- minhash_func(tokens)
} else {
minhashes <- NULL
minhash_func_name <- NULL
}
} else {
tokens <- NULL
hashes <- NULL
minhashes <- NULL
tokenizer_name <- NULL
hash_func_name <- NULL
minhash_func_name <- NULL
}
if (!keep_tokens) tokens <- NULL
if (!keep_text) text <- NULL
if (missing(meta)) {
meta <- list(file = file,
id = document_id,
tokenizer = tokenizer_name,
hash_func = hash_func_name,
minhash_func = minhash_func_name)
}
assert_that(is.list(meta))
if (!is.null(file)) {
meta$file <- file
meta$id <- document_id
}
# Don't overwrite these when called from TextReuseCorpus
if (is.null(meta$tokenizer) & is.null(meta$hash_func) &
is.null(meta$minhash_func)) {
meta$tokenizer <- tokenizer_name
meta$hash_func <- hash_func_name
meta$minhash_func <- minhash_func_name
}
meta <- sort_meta(meta)
doc <- list(
content = text,
tokens = tokens,
hashes = hashes,
minhashes = minhashes,
meta = meta
)
class(doc) <- c("TextReuseTextDocument", "TextDocument")
doc
}
short_document_word_minimum <- function(tokenizer, args) {
n <- args$n
if (is.null(n)) n <- 3
if (!is.null(tokenizer) && identical(tokenizer, tokenize_skip_ngrams)) {
k <- args$k
if (is.null(k)) k <- 1
return(n + n * k - k)
}
n + 1
}
#' @importFrom NLP meta
#' @export
NLP::meta
#' @importFrom NLP meta<-
#' @export
NLP::`meta<-`
#' @importFrom NLP content
#' @export
NLP::content
#' @importFrom NLP content<-
#' @export
NLP::`content<-`
#' @export
print.TextReuseTextDocument <- function(x, ...) {
cat("TextReuseTextDocument\n")
pretty_print_metadata(x)
cat("content", ":", str_sub(x$content, end = 200))
invisible(x)
}
#' @export
as.character.TextReuseTextDocument <- function(x, ...) {
as.character(x$content)
}
#' @export
#' @method content TextReuseTextDocument
content.TextReuseTextDocument <- function(x) {
x$content
}
#' @export
#' @method content<- TextReuseTextDocument
`content<-.TextReuseTextDocument` <- function(x, value) {
assert_that(is.character(value))
x$content <- value
x
}
#' @export
#' @method meta TextReuseTextDocument
meta.TextReuseTextDocument <- function(x, tag = NULL, ...) {
if (is.null(tag))
x$meta
else
x$meta[[tag]]
}
#' @export
#' @method meta<- TextReuseTextDocument
`meta<-.TextReuseTextDocument` <- function(x, tag = NULL, ..., value) {
if (is.null(tag)) {
assert_that(is.list(value))
x$meta <- value
} else {
x$meta[[tag]] <- value
}
x
}
#' Accessors for TextReuse objects
#'
#' Accessor functions to read and write components of
#' \code{\link{TextReuseTextDocument}} and \code{\link{TextReuseCorpus}}
#' objects.
#' @name TextReuseTextDocument-accessors
#' @param x The object to access.
#' @param value The value to assign.
#' @return Either a vector or a named list of vectors.
NULL
#' @export
#' @rdname TextReuseTextDocument-accessors
tokens <- function(x) UseMethod("tokens", x)
#' @export
tokens.TextReuseTextDocument <- function(x) x$tokens
#' @export
tokens.TextReuseCorpus <- function(x) {
corpus_names <- names(x)
l <- lapply(x$documents, function(i) tokens(i))
names(l) <- corpus_names
l
}
#' @export
#' @rdname TextReuseTextDocument-accessors
`tokens<-` <- function(x, value) UseMethod("tokens<-", x)
#' @export
`tokens<-.TextReuseTextDocument` <- function(x, value) {
x$tokens <- value
x
}
#' @export
#' @rdname TextReuseTextDocument-accessors
hashes <- function(x) UseMethod("hashes", x)
#' @export
hashes.TextReuseTextDocument <- function(x) x$hashes
#' @export
hashes.TextReuseCorpus <- function(x) {
corpus_names <- names(x)
l <- lapply(x$documents, function(i) hashes(i))
names(l) <- corpus_names
l
}
#' @export
#' @rdname TextReuseTextDocument-accessors
`hashes<-` <- function(x, value) UseMethod("hashes<-", x)
#' @export
`hashes<-.TextReuseTextDocument` <- function(x, value) {
x$hashes <- value
x
}
#' @export
#' @rdname TextReuseTextDocument-accessors
minhashes <- function(x) UseMethod("minhashes", x)
#' @export
minhashes.TextReuseTextDocument <- function(x) x$minhashes
#' @export
minhashes.TextReuseCorpus <- function(x) {
corpus_names <- names(x)
l <- lapply(x$documents, function(i) minhashes(i))
names(l) <- corpus_names
l
}
#' @export
#' @rdname TextReuseTextDocument-accessors
`minhashes<-` <- function(x, value) UseMethod("minhashes<-", x)
#' @export
`minhashes<-.TextReuseTextDocument` <- function(x, value) {
x$minhashes <- value
x
}
#' @param x An R object to check.
#' @export
#' @rdname TextReuseTextDocument
is.TextReuseTextDocument <- function(x) {
inherits(x, "TextReuseTextDocument")
}
#' @export
#' @rdname TextReuseTextDocument
has_content <- function(x) {
assert_that(is.TextReuseTextDocument(x))
!is.null(x$content)
}
assertthat::on_failure(has_content) <- function(call, env) {
paste0("Document does not have text in its content field.")
}
#' @export
#' @rdname TextReuseTextDocument
has_tokens <- function(x) {
assert_that(is.TextReuseTextDocument(x))
!is.null(x$tokens)
}
assertthat::on_failure(has_tokens) <- function(call, env) {
"Document does not have tokens."
}
#' @export
#' @rdname TextReuseTextDocument
has_hashes <- function(x) {
assert_that(is.TextReuseTextDocument(x))
!is.null(x$hashes)
}
assertthat::on_failure(has_hashes) <- function(call, env) {
"Document does not have hashes."
}
#' @export
#' @rdname TextReuseTextDocument
has_minhashes <- function(x) {
assert_that(is.TextReuseTextDocument(x))
!is.null(x$minhashes)
}
assertthat::on_failure(has_minhashes) <- function(call, env) {
"Document does not have a minhash signature."
}
has_minhashes_corpus <- function(x) {
assert_that(is.TextReuseCorpus(x))
all(vapply(minhashes(x), Negate(is.null), logical(1)))
}
assertthat::on_failure(has_minhashes_corpus) <- function(call, env) {
"Some documents in the corpus do not have a minhash signature."
}
================================================
FILE: R/align_local.R
================================================
#' Local alignment of natural language texts
#'
#' This function takes two texts, either as strings or as
#' \code{TextReuseTextDocument} objects, and finds the optimal local alignment
#' of those texts. A local alignment finds the best matching subset of the two
#' documents. This function adapts the
#' \href{https://en.wikipedia.org/wiki/Smith-Waterman_algorithm}{Smith-Waterman
#' algorithm}, used for genetic sequencing, for use with natural language. It
#' compare the texts word by word (the comparison is case-insensitive) and
#' scores them according to a set of parameters. These parameters define the
#' score for a \code{match}, and the penalties for a \code{mismatch} and for
#' opening a \code{gap} (i.e., the first mismatch in a potential sequence). The
#' function then reports the optimal local alignment. Only the subset of the
#' documents that is a match is included. Insertions or deletions in the text
#' are reported with the \code{edit_mark} character.
#'
#' @param a A character vector of length one, or a
#' \code{\link{TextReuseTextDocument}}.
#' @param b A character vector of length one, or a
#' \code{\link{TextReuseTextDocument}}.
#' @param match The score to assign a matching word. Should be a positive
#' integer.
#' @param mismatch The score to assign a mismatching word. Should be a negative
#' integer or zero.
#' @param gap The penalty for opening a gap in the sequence. Should be a
#' negative integer or zero.
#' @param edit_mark A single character used for displaying for displaying
#' insertions/deletions in the documents.
#' @param preserve_punctuation Preserve punctuation in the displayed alignment.
#' The alignment still compares tokens after stripping punctuation.
#' @param progress Display a progress bar and messages while computing the
#' alignment.
#'
#' @return A list with the class \code{textreuse_alignment}. This list contains
#' several elements: \itemize{ \item \code{a_edit} and \code{b_edit}:
#' Character vectors of the sequences with edits marked. \item \code{score}:
#' The score of the optimal alignment. }
#'
#' @details
#'
#' The compute time of this function is proportional to the product of the
#' lengths of the two documents. Thus, longer documents will take considerably
#' more time to compute. This function has been tested with pairs of documents
#' containing about 25 thousand words each.
#'
#' If the function reports that there were multiple optimal alignments, then it
#' is likely that there is no strong match in the document.
#'
#' The score reported for the local alignment is dependent on both the size of
#' the documents and on the strength of the match, as well as on the parameters
#' for match, mismatch, and gap penalties, so the scores are not directly
#' comparable.
#'
#' @references For a useful description of the algorithm, see
#' \href{http://etherealbits.com/2013/04/string-alignment-dynamic-programming-dna/}{this
#' post}. For the application of the Smith-Waterman algorithm to natural
#' language, see David A. Smith, Ryan Cordell, and Elizabeth Maddock Dillon,
#' "Infectious Texts: Modeling Text Reuse in Nineteenth-Century Newspapers,"
#' IEEE International Conference on Big Data, 2013.
#'
#' @examples
#' align_local("The answer is blowin' in the wind.",
#' "As the Bob Dylan song says, the answer is blowing in the wind.")
#'
#' # Example of matching documents from a corpus
#' dir <- system.file("extdata/legal", package = "textreuse")
#' corpus <- TextReuseCorpus(dir = dir, progress = FALSE)
#' alignment <- align_local(corpus[["ca1851-match"]], corpus[["ny1850-match"]])
#' str(alignment)
#'
#' @export
align_local <- function(a, b, match = 2L, mismatch = -1L, gap = -1L,
edit_mark = "#", preserve_punctuation = FALSE,
progress = interactive()) {
assert_that(identical(class(a), class(b)))
UseMethod("align_local", a)
}
#' @export
align_local.TextReuseTextDocument <- function(a, b, match = 2L, mismatch = -1L,
gap = -1L, edit_mark = "#",
preserve_punctuation = FALSE,
progress = interactive()) {
align_local(content(a), content(b), match = match, mismatch = mismatch,
gap = gap, edit_mark = edit_mark,
preserve_punctuation = preserve_punctuation)
}
#' @export
align_local.default <- function(a, b, match = 2L, mismatch = -1L, gap = -1L,
edit_mark = "#", preserve_punctuation = FALSE,
progress = interactive()) {
assert_that(is.string(a),
is.string(b),
is_integer_like(match),
is_integer_like(mismatch),
is_integer_like(gap),
is.string(edit_mark),
is.flag(preserve_punctuation))
if (match <= 0 || mismatch > 0 || gap > 0 || !(str_length(edit_mark) == 1)) {
stop("The scoring parameters should have the following characteristics:\n",
" - `match` should be a positive integer\n",
" - `mismatch` should be a negative integer or zero\n",
" - `gap` should be a negative integer or zero\n",
" - `edit_mark` should be a single character\n")
}
# Keep everything as integers because IntegerMatrix saves memory
match <- as.integer(match)
mismatch <- as.integer(mismatch)
gap <- as.integer(gap)
# Prepare the character vectors. Tokenize to words to compare word by word.
# Use all lower case for the comparison, but use original capitalization in
# the output.
a_orig <- align_tokens(a, preserve_punctuation = preserve_punctuation)
b_orig <- align_tokens(b, preserve_punctuation = preserve_punctuation)
a <- normalize_alignment_tokens(a_orig)
b <- normalize_alignment_tokens(b_orig)
# Only show a progress bar for long computations
n_rows <- length(b) + 1
n_cols <- length(a) + 1
if (n_rows * n_cols < 1e7) progress <- FALSE
# Create the integer matrix
if (progress) {
message("Preparing a matrix with ",
prettyNum(n_rows * n_cols, big.mark = ","),
" elements.")
}
m <- matrix(0L, n_rows, n_cols)
# Calculate the matrix of possible paths
if (progress) message("Computing the optimal local alignment.")
m <- sw_matrix(m, a, b, match, mismatch, gap, progress)
# Find the starting place in the matrix
alignment_score <- max(m)
if (alignment_score == 0) {
alignment <- list(a_edits = "", b_edits = "", score = alignment_score)
class(alignment) <- c("textreuse_alignment", "list")
return(alignment)
}
max_match <- which(m == alignment_score, arr.ind = TRUE, useNames = FALSE)
if (nrow(max_match) > 1) {
warning("Multiple optimal local alignments found; selecting only one of them.",
call. = FALSE)
}
if (progress) message("Extracting the local alignment.")
# Create output vectors which are as long as conceivably necessary
a_out <- vector(mode = "character", length = max(max_match))
b_out <- vector(mode = "character", length = max(max_match))
a_out[] <- NA_character_
b_out[] <- NA_character_
# Initialize counters for the matrix and the output vector
row_i <- max_match[1, 1]
col_i <- max_match[1, 2]
out_i <- 1L
# Place our first known values in the output vectors
b_out[out_i] <- b_orig[row_i - 1]
a_out[out_i] <- a_orig[col_i - 1]
out_i = out_i + 1L # Advance the out vector position
# Begin moving up, left, or diagonally within the matrix till we hit a zero
while (m[row_i - 1, col_i - 1] != 0) {
# Values of the current cell, the cells up, left, diagonal, and the max
up <- m[row_i - 1, col_i]
left <- m[row_i, col_i - 1]
diagn <- m[row_i - 1, col_i - 1]
max_cell <- max(up, left, diagn)
# Move in the direction of the maximum cell. If there are ties, choose up
# first, then left, then diagonal. Privilege up and left because they
# preserve edits.
#
# In each case add the current words to the out vectors. For moves up and
# and left there will be an insertion/deletion, so add a symbol like ####
# that is the same number of characters as the word in the other vector.
#
# Note that the index of the matrix is offset by one from character vectors
# a and b, so we use the row and column indices - 1. The column corresponds
# to `a` and the rows correspond to `b`.
if (up == max_cell) {
row_i <- row_i - 1
bword <- b_orig[row_i - 1]
b_out[out_i] <- bword
a_out[out_i] <- mark_chars(bword, edit_mark)
} else if (left == max_cell) {
col_i <- col_i - 1
aword <- a_orig[col_i - 1]
b_out[out_i] <- mark_chars(aword, edit_mark)
a_out[out_i] <- aword
} else if (diagn == max_cell) {
row_i <- row_i - 1
col_i <- col_i - 1
bword <- b_orig[row_i - 1]
aword <- a_orig[col_i - 1]
# Diagonals are a special case, because instead of an insertion or a
# deletion we might have a substitution of words. If that is the case,
# then treat it like a double insertion and deletion.
if (a[col_i - 1] == b[row_i - 1]) {
b_out[out_i] <- bword
a_out[out_i] <- aword
} else {
b_out[out_i] <- bword
a_out[out_i] <- mark_chars(bword, edit_mark)
out_i <- out_i + 1
b_out[out_i] <- mark_chars(aword, edit_mark)
a_out[out_i] <- aword
}
}
# Move forward one position in the out vectors, no matter which direction
# we moved
out_i <- out_i + 1
}
# Clean up the outputs
b_out <- str_c(rev(b_out[!is.na(b_out)]), collapse = " ")
a_out <- str_c(rev(a_out[!is.na(a_out)]), collapse = " ")
# Create the alignment object
alignment <- list(a_edits = a_out, b_edits = b_out, score = alignment_score)
class(alignment) <- c("textreuse_alignment", "list")
alignment
}
align_tokens <- function(x, preserve_punctuation) {
if (!preserve_punctuation) return(tokenize_words(x, lowercase = FALSE))
tokens <- str_split(str_squish(x), "\\s+")[[1]]
tokens[tokens != ""]
}
normalize_alignment_tokens <- function(x) {
str_to_lower(str_replace_all(x, "[[:punct:]]", ""))
}
#' @export
print.textreuse_alignment <- function(x, ...) {
cat("TextReuse alignment\n")
cat("Alignment score:", x$score, "\n")
cat("Document A:\n")
cat(str_wrap(x$a_edits, width = 72))
cat("\n\nDocument B:\n")
cat(str_wrap(x$b_edits, width = 72))
cat("\n\n")
invisible(x)
}
================================================
FILE: R/conversion-functions.R
================================================
#' Convert candidates data frames to other formats
#'
#' These functions convert a \code{textreuse_candidates} object to dense or
#' sparse matrices.
#'
#' @param x An object of class \code{\link[=lsh_compare]{textreuse_candidates}}.
#' @param ... Additional arguments.
#'
#' @return A similarity matrix with row and column names containing document IDs.
#'
#' @export
#' @method as.matrix textreuse_candidates
as.matrix.textreuse_candidates <- function(x, ...) {
docs <- candidate_doc_ids(x)
n <- length(docs)
m <- matrix(0, n, n)
rownames(m) <- docs
colnames(m) <- docs
diag(m) <- 1.0
for (r in seq_len(nrow(x))) {
a <- x$a[r]
b <- x$b[r]
score <- x$score[r]
m[a, b] <- score
m[b, a] <- score
}
m
}
#' @rdname as.matrix.textreuse_candidates
#' @export
as_sparse_matrix <- function(x) {
assert_that(is_candidates_df(x))
docs <- candidate_doc_ids(x)
n <- length(docs)
doc_ids <- stats::setNames(seq_along(docs), docs)
rows <- seq_len(n)
cols <- seq_len(n)
values <- rep(1.0, n)
if (nrow(x) > 0) {
a <- unname(doc_ids[x$a])
b <- unname(doc_ids[x$b])
rows <- c(rows, a, b)
cols <- c(cols, b, a)
values <- c(values, x$score, x$score)
}
Matrix::sparseMatrix(i = rows, j = cols, x = values, dims = c(n, n),
use.last.ij = TRUE,
dimnames = list(docs, docs))
}
candidate_doc_ids <- function(x) {
all_doc_ids <- attr(x, "all-doc-ids")
if (is.null(all_doc_ids)) {
all_doc_ids <- c(x$a, x$b)
}
sort(unique(all_doc_ids))
}
================================================
FILE: R/filenames.R
================================================
#' Filenames from paths
#'
#' This function takes a character vector of paths and returns just the file
#' name, by default without the extension. A \code{\link{TextReuseCorpus}} uses
#' the paths to the files in the corpus as the names of the list. This function
#' is intended to turn those paths into more manageable identifiers.
#'
#' @param paths A character vector of paths.
#' @param extension Should the file extension be preserved?
#' @seealso \code{\link{basename}}
#' @examples
#' paths <- c("corpus/one.txt", "corpus/two.md", "corpus/three.text")
#' filenames(paths)
#' filenames(paths, extension = TRUE)
#' @export
filenames <- function(paths, extension = FALSE) {
assert_that(is.character(paths))
f <- basename(paths)
if (extension)
return(f)
else
str_replace(f, "\\.[:alpha:]{1,}$", "")
}
================================================
FILE: R/lsh.R
================================================
#'Locality sensitive hashing for minhash
#'
#'Locality sensitive hashing (LSH) discovers potential matches among a corpus of
#'documents quickly, so that only likely pairs can be compared.
#'
#'@details Locality sensitive hashing is a technique for detecting document
#' similarity that does not require pairwise comparisons. When comparing pairs
#' of documents, the number of pairs grows rapidly, so that only the smallest
#' corpora can be compared pairwise in a reasonable amount of computation time.
#' Locality sensitive hashing, on the other hand, takes a document which has
#' been tokenized and hashed using a minhash algorithm. (See
#' \code{\link{minhash_generator}}.) Each set of minhash signatures is then
#' broken into bands comprised of a certain number of rows. (For example, 200
#' minhash signatures might be broken down into 20 bands each containing 10
#' rows.) Each band is then hashed to a bucket. Documents with identical rows
#' in a band will be hashed to the same bucket. The likelihood that a document
#' will be marked as a potential duplicate is proportional to the number of
#' bands and inversely proportional to the number of rows in each band.
#'
#' This function returns a data frame with the additional class
#' \code{lsh_buckets}. The LSH technique only requires that the signatures for
#' each document be calculated once. So it is possible, as long as one uses the
#' same minhash function and the same number of bands, to combine the outputs
#' from this function at different times. The output can thus be treated as a
#' kind of cache of LSH signatures.
#'
#' To extract pairs of documents from the output of this function, see
#' \code{\link{lsh_candidates}}.
#'
#'@param x A \code{\link{TextReuseCorpus}} or
#' \code{\link{TextReuseTextDocument}}.
#'@param bands The number of bands to use for locality sensitive hashing. The
#' number of hashes in the documents in the corpus must be evenly divisible by
#' the number of bands. See \code{\link{lsh_threshold}} and
#' \code{\link{lsh_probability}} for guidance in selecting the number of bands
#' and hashes.
#'@param progress Display a progress bar while comparing documents.
#'
#'@return A data frame (with the additional class \code{lsh_buckets}),
#' containing a column with the document IDs and a column with their LSH
#' signatures, or buckets.
#'
#'@references Jure Leskovec, Anand Rajaraman, and Jeff Ullman,
#' \emph{Mining of Massive Datasets} (Cambridge University Press, 2011), ch.
#' 3. See also Matthew Casperson,
#' "\href{http://matthewcasperson.blogspot.com/2013/11/minhash-for-dummies.html}{Minhash
#' for Dummies}" (November 14, 2013).
#'
#'@seealso \code{\link{minhash_generator}}, \code{\link{lsh_add}},
#' \code{\link{lsh_candidates}}, \code{\link{lsh_query}},
#' \code{\link{lsh_probability}},
#' \code{\link{lsh_threshold}}
#'
#' @examples
#' dir <- system.file("extdata/legal", package = "textreuse")
#' minhash <- minhash_generator(200, seed = 235)
#' corpus <- TextReuseCorpus(dir = dir,
#' tokenizer = tokenize_ngrams, n = 5,
#' minhash_func = minhash)
#' buckets <- lsh(corpus, bands = 50)
#' buckets
#'@export
lsh <- function(x, bands, progress = interactive()) {
UseMethod("lsh", x)
}
#' Add documents to a LSH cache
#'
#' This function adds buckets for one or more new documents to an existing
#' \code{lsh_buckets} object. Use the same \code{bands} value and minhash
#' function that were used to create the original buckets.
#'
#' @param buckets An \code{lsh_buckets} object created by \code{\link{lsh}}.
#' @param x A \code{\link{TextReuseCorpus}} or
#' \code{\link{TextReuseTextDocument}} with minhashes.
#' @inheritParams lsh
#' @return An updated \code{lsh_buckets} object.
#' @seealso \code{\link{lsh}}, \code{\link{lsh_query}},
#' \code{\link{lsh_candidates}}
#' @export
lsh_add <- function(buckets, x, bands, progress = interactive()) {
assert_that(is_lsh_buckets(buckets))
new_buckets <- lsh(x, bands = bands, progress = progress)
new_doc_ids <- unique(new_buckets$doc)
buckets <- buckets %>%
dplyr::filter(!.data$doc %in% new_doc_ids) %>%
dplyr::bind_rows(new_buckets) %>%
dplyr::arrange(.data$doc)
class(buckets) <- c("lsh_buckets", setdiff(class(buckets), "lsh_buckets"))
buckets
}
#' @export
lsh.TextReuseCorpus <- function(x, bands, progress = interactive()) {
assert_that(is.count(bands),
has_minhashes_corpus(x))
h <- length(minhashes(x[[1]])) # number of hashes
d <- length(x) # number of documents
r <- h / bands # number of rows
assert_that(check_banding(h, bands))
# To assign rows in data frame to bands
b_assign <- tibble::tibble(band =
rep(vapply(1:bands, function(i) rep(i, r), integer(r)), d)
)
all_minhashes <- minhashes(x)
col_names <- names(all_minhashes)
buckets <- all_minhashes %>%
tibble::as_tibble() %>%
tidyr::gather("doc", "hash", col_names) %>%
dplyr::mutate(doc = as.character(.data$doc)) %>%
dplyr::bind_cols(b_assign) %>%
dplyr::group_by(.data$doc, .data$band)
rm(b_assign)
if (progress) {
message("Calculating LSH buckets")
pb <- txtProgressBar(min = 0, max = d * bands, style = 3)
}
# include the band in the signature hash to avoid false matches
buckets <- buckets %>%
dplyr::summarize(buckets = digest_progress(list(hash, unique(band)),
pb, progress))
if (progress) close(pb)
buckets <- buckets %>%
dplyr::select(-.data$band) %>%
dplyr::ungroup()
class(buckets) <- c("lsh_buckets", class(buckets))
buckets
}
# A wrapper around digest to be able to use the progress bar
digest_progress <- function(x, pb, progress) {
bucket <- digest::digest(x)
if (progress) setTxtProgressBar(pb, getTxtProgressBar(pb) + 1)
bucket
}
#' @export
lsh.TextReuseTextDocument <- function(x, bands, progress) {
assert_that(is.count(bands),
has_minhashes(x))
all_minhashes <- minhashes(x)
h <- length(all_minhashes) # number of hashes
r <- h / bands # number of rows
assert_that(check_banding(h, bands))
# To assign rows in data frame to bands
b_assign <- tibble::tibble(band =
rep(vapply(1:bands, function(i) rep(i, r), integer(r)), 1)
)
buckets <- tibble::tibble(doc = x$meta$id, hash = all_minhashes) %>%
dplyr::bind_cols(b_assign) %>%
dplyr::group_by(.data$doc, .data$band) %>%
dplyr::summarize(buckets = digest::digest(list(hash, unique(band)))) %>%
dplyr::select(-.data$band) %>%
dplyr::ungroup()
class(buckets) <- c("lsh_buckets", class(buckets))
buckets
}
================================================
FILE: R/lsh_candidates.R
================================================
#' Candidate pairs from LSH comparisons
#'
#' Given a data frame of LSH buckets returned from \code{\link{lsh}}, this
#' function returns the potential candidates.
#'
#' @param buckets A data frame returned from \code{\link{lsh}}.
#'
#' @return A data frame of candidate pairs.
#'
#' @examples
#' dir <- system.file("extdata/legal", package = "textreuse")
#' minhash <- minhash_generator(200, seed = 234)
#' corpus <- TextReuseCorpus(dir = dir,
#' tokenizer = tokenize_ngrams, n = 5,
#' minhash_func = minhash)
#' buckets <- lsh(corpus, bands = 50)
#' lsh_candidates(buckets)
#'
#' @export
lsh_candidates <- function(buckets) {
assert_that(is_lsh_buckets(buckets))
candidates <- buckets %>%
dplyr::left_join(buckets, by = "buckets") %>%
dplyr::filter(.data$doc.x != .data$doc.y) %>%
dplyr::distinct(doc.x, doc.y) %>%
dplyr::arrange(.data$doc.x, .data$doc.y) %>%
dplyr::mutate(dn = pmin(.data$doc.x, .data$doc.y),
up = pmax(.data$doc.x, .data$doc.y)) %>%
dplyr::distinct(.data$up, .data$dn) %>%
dplyr::select(a = dn, b = up) %>%
dplyr::arrange(.data$a, .data$b) %>%
dplyr::mutate(score = NA_real_)
class(candidates) <- c("textreuse_candidates", class(candidates))
candidates
}
================================================
FILE: R/lsh_compare.R
================================================
#' Compare candidates identified by LSH
#'
#' The \code{\link{lsh_candidates}} only identifies potential matches, but
#' cannot estimate the actual similarity of the documents. This function takes a
#' data frame returned by \code{\link{lsh_candidates}} and applies a comparison
#' function to each of the documents in a corpus, thereby calculating the
#' document similarity score. Note that since your corpus will have minhash
#' signatures rather than hashes for the tokens itself, you will probably wish
#' to use \code{\link{tokenize}} to calculate new hashes. This can be done for
#' just the potentially similar documents. See the package vignettes for
#' details.
#'
#' @param candidates A data frame returned by \code{\link{lsh_candidates}}.
#' @param corpus The same \code{\link{TextReuseCorpus}} corpus which was used to generate the candidates.
#' @param f A comparison function such as \code{\link{jaccard_similarity}}.
#' @param progress Display a progress bar while comparing documents. Progress
#' bars are disabled when using parallel processing.
#' @return A data frame with values calculated for \code{score}.
#' @examples
#' dir <- system.file("extdata/legal", package = "textreuse")
#' minhash <- minhash_generator(200, seed = 234)
#' corpus <- TextReuseCorpus(dir = dir,
#' tokenizer = tokenize_ngrams, n = 5,
#' minhash_func = minhash)
#' buckets <- lsh(corpus, bands = 50)
#' candidates <- lsh_candidates(buckets)
#' lsh_compare(candidates, corpus, jaccard_similarity)
#' @export
lsh_compare <- function(candidates, corpus, f, progress = interactive()) {
assert_that(is_candidates_df(candidates),
is.function(f),
is.TextReuseCorpus(corpus))
rows_to_score <- which(is.na(candidates$score))
num_rows <- length(rows_to_score)
use_parallel <- using_parallel()
if (num_rows == 0) {
attr(candidates, "all-doc-ids") <- names(corpus)
return(candidates)
}
if (progress) {
message("Making ", prettyNum(num_rows, big.mark = ","),
" comparisons.")
if (!use_parallel) {
pb <- txtProgressBar(min = 0, max = num_rows, style = 3)
}
}
apply_fun <- get_apply_function()
scores <- apply_fun(seq_along(rows_to_score), function(j) {
i <- rows_to_score[j]
a <- candidates$a[i]
b <- candidates$b[i]
score <- f(corpus[[a]], corpus[[b]])
if (progress && !use_parallel) setTxtProgressBar(pb, j)
score
})
candidates$score[rows_to_score] <- unlist(scores, use.names = FALSE)
if (progress && !use_parallel) close(pb)
attr(candidates, "all-doc-ids") <- names(corpus)
candidates
}
================================================
FILE: R/lsh_probability.R
================================================
#' Probability that a candidate pair will be detected with LSH
#'
#' Functions to help choose the correct parameters for the \code{\link{lsh}} and
#' \code{\link{minhash_generator}} functions. Use \code{lsh_threshold} to
#' determine the minimum Jaccard similarity for two documents for them to likely
#' be considered a match. Use \code{lsh_probability} to determine the
#' probability that a pair of documents with a known Jaccard similarity will be
#' detected.
#'
#' @param h The number of minhash signatures.
#' @param b The number of LSH bands.
#' @param s The Jaccard similarity.
#' @details Locality sensitive hashing returns a list of possible matches for
#' similar documents. How likely is it that a pair of documents will be detected
#' as a possible match? If \code{h} is the number of minhash signatures,
#' \code{b} is the number of bands in the LSH function (implying then that the
#' number of rows \code{r = h / b}), and \code{s} is the actual Jaccard
#' similarity of the two documents, then the probability \code{p} that the two
#' documents will be marked as a candidate pair is given by this equation.
#'
#' \deqn{p = 1 - (1 - s^{r})^{b}}
#'
#' According to \href{http://infolab.stanford.edu/~ullman/mmds/book.pdf}{MMDS},
#' that equation approximates an S-curve. This implies that there is a threshold
#' (\code{t}) for \code{s} approximated by this equation.
#'
#' \deqn{t = \frac{1}{b}^{\frac{1}{r}}}
#'
#' @references Jure Leskovec, Anand Rajaraman, and Jeff Ullman,
#' \emph{Mining of Massive Datasets} (Cambridge University Press, 2011), ch. 3.
#' @examples
#' # Threshold for default values
#' lsh_threshold(h = 200, b = 40)
#'
#' # Probability for varying values of s
#' lsh_probability(h = 200, b = 40, s = .25)
#' lsh_probability(h = 200, b = 40, s = .50)
#' lsh_probability(h = 200, b = 40, s = .75)
#' @export
lsh_probability <- function(h, b, s) {
assert_that(is.count(h),
is.count(b),
check_banding(h, b),
is.number(s))
1 - (1 - s ^ (h / b)) ^ b
}
#' @rdname lsh_probability
#' @export
lsh_threshold <- function(h, b) {
assert_that(is.count(h),
is.count(b),
check_banding(h, b))
(1 / b ) ^ (1 / (h / b))
}
================================================
FILE: R/lsh_query.R
================================================
#' Query a LSH cache for matches to a single document
#'
#' This function retrieves the matches for a single document from an \code{lsh_buckets} object created by \code{\link{lsh}}. See \code{\link{lsh_candidates}} to retrieve all pairs of matches.
#'
#' @param buckets An \code{lsh_buckets} object created by \code{\link{lsh}}.
#' @param id The document ID to find matches for.
#'
#' @return An \code{lsh_candidates} data frame with matches to the document specified.
#'
#' @examples
#' dir <- system.file("extdata/legal", package = "textreuse")
#' minhash <- minhash_generator(200, seed = 235)
#' corpus <- TextReuseCorpus(dir = dir,
#' tokenizer = tokenize_ngrams, n = 5,
#' minhash_func = minhash)
#' buckets <- lsh(corpus, bands = 50)
#' lsh_query(buckets, "ny1850-match")
#'
#' @seealso \code{\link{lsh}}, \code{\link{lsh_candidates}}
#' @export
lsh_query <- function(buckets, id) {
assert_that(is_lsh_buckets(buckets),
is.string(id))
signatures <- buckets %>%
dplyr::filter(.data$doc == id) %>%
`$`("buckets")
docs <- buckets %>%
dplyr::filter(.data$buckets %in% signatures) %>%
`$`("doc")
res <- tibble::tibble(a = id, b = docs, score = NA_real_) %>%
dplyr::filter(.data$a != .data$b) %>%
dplyr::distinct(.data$a, .data$b)
class(res) <- c("textreuse_candidates", class(res))
res
}
================================================
FILE: R/lsh_subset.R
================================================
#' List of all candidates in a corpus
#'
#' @param candidates A data frame of candidate pairs from
#' \code{\link{lsh_candidates}}.
#' @return A character vector of document IDs from the candidate pairs, to be
#' used to subset the \code{\link{TextReuseCorpus}}.
#' @examples
#' dir <- system.file("extdata/legal", package = "textreuse")
#' minhash <- minhash_generator(200, seed = 234)
#' corpus <- TextReuseCorpus(dir = dir,
#' tokenizer = tokenize_ngrams, n = 5,
#' minhash_func = minhash)
#' buckets <- lsh(corpus, bands = 50)
#' candidates <- lsh_candidates(buckets)
#' lsh_subset(candidates)
#' corpus[lsh_subset(candidates)]
#' @export
lsh_subset <- function(candidates) {
assert_that(is_candidates_df(candidates))
sort(unique(c(candidates$a, candidates$b)))
}
================================================
FILE: R/minhash.R
================================================
#' Generate a minhash function
#'
#' A minhash value is calculated by hashing the strings in a character vector to
#' integers and then selecting the minimum value. Repeated minhash values are
#' generated by using different hash functions: these different hash functions
#' are created by using performing a bitwise \code{XOR} operation
#' (\code{\link{bitwXor}}) with a vector of random integers. Since it is vital
#' that the same random integers be used for each document, this function
#' generates another function which will always use the same integers. The
#' returned function is intended to be passed to the \code{hash_func} parameter
#' of \code{\link{TextReuseTextDocument}}.
#'
#' @param n The number of minhashes that the returned function should generate.
#' @param seed An option parameter to set the seed used in generating the random
#' numbers to ensure that the same minhash function is used on repeated
#' applications.
#' @return A function which will take a character vector and return \code{n}
#' minhashes.
#' @references Jure Leskovec, Anand Rajaraman, and Jeff Ullman,
#' \emph{Mining of Massive Datasets} (Cambridge University Press, 2011), ch.
#' 3. See also Matthew Casperson,
#' "\href{http://matthewcasperson.blogspot.com/2013/11/minhash-for-dummies.html}{Minhash
#' for Dummies}" (November 14, 2013).
#' @seealso \code{\link{lsh}}
#' @examples
#' set.seed(253)
#' minhash <- minhash_generator(10)
#'
#' # Example with a TextReuseTextDocument
#' file <- system.file("extdata/legal/ny1850-match.txt", package = "textreuse")
#' doc <- TextReuseTextDocument(file = file, hash_func = minhash,
#' keep_tokens = TRUE)
#' hashes(doc)
#'
#' # Example with a character vector
#' is.character(tokens(doc))
#' minhash(tokens(doc))
#' @export
minhash_generator <- function(n = 200, seed = NULL) {
assert_that(is.count(n))
if (!is.null(seed)) set.seed(seed)
r <- random_ints(n)
f <- function(x) {
assert_that(is.character(x))
h <- hash_string(x)
vapply(r, function(i) { min(bitwXor(h, i)) },
integer(1), USE.NAMES = FALSE)
}
f
}
# Generate random integers for minhashing
#
# It is crucial that you use the same random integers for every document in the
# corpus. The random integers generated by this function are intended to be
# passed to \code{\link{minhash}}.
# @param n The number of random integers to generate.
# @return A vector of integers
# @seealso \code{\link{minhash}}
# @examples
# random_ints(3)
random_ints <- function(n) {
as.integer(stats::runif(n, -2147483648, 2147483647))
}
================================================
FILE: R/pairwise_candidates.R
================================================
#' Candidate pairs from pairwise comparisons
#'
#' Converts a comparison matrix generated by \code{\link{pairwise_compare}} into a
#' data frame of candidates for matches.
#'
#' @param m A matrix from \code{\link{pairwise_compare}}.
#' @param directional Should be set to the same value as in
#' \code{\link{pairwise_compare}}.
#' @return A data frame containing all the non-\code{NA} values from \code{m}.
#' Columns \code{a} and \code{b} are the IDs from the original corpus as
#' passed to the comparison function. Column \code{score} is the score
#' returned by the comparison function.
#' @examples
#' dir <- system.file("extdata/legal", package = "textreuse")
#' corpus <- TextReuseCorpus(dir = dir)
#'
#' m1 <- pairwise_compare(corpus, ratio_of_matches, directional = TRUE)
#' pairwise_candidates(m1, directional = TRUE)
#'
#' m2 <- pairwise_compare(corpus, jaccard_similarity)
#' pairwise_candidates(m2)
#' @export
pairwise_candidates <- function(m, directional = FALSE) {
assert_that(is.matrix(m))
matches <- which(!is.na(m))
indexes <- arrayInd(matches, dim(m))
score <- m[matches]
a <- rownames(m)[indexes[ , 1]]
b <- colnames(m)[indexes[ , 2]]
df <- data.frame(a = a, b = b, score = score, stringsAsFactors = FALSE)
if (!directional) df <- sort_df_by_rows(df)
df <- sort_df_by_columns(df)
class(df) <- c("textreuse_candidates", "tbl_df", "tbl", "data.frame")
df
}
================================================
FILE: R/pairwise_compare.R
================================================
#' Pairwise comparisons among documents in a corpus
#'
#' Given a \code{\link{TextReuseCorpus}} containing documents of class
#' \code{\link{TextReuseTextDocument}}, this function applies a comparison
#' function to every pairing of documents, and returns a matrix with the
#' comparison scores.
#'
#' @param corpus A \code{\link{TextReuseCorpus}}.
#' @param f The function to apply to \code{x} and \code{y}.
#' @param ... Additional arguments passed to \code{f}.
#' @param directional Some comparison functions are commutative, so that
#' \code{f(a, b) == f(b, a)} (e.g., \code{\link{jaccard_similarity}}). Other
#' functions are directional, so that \code{f(a, b)} measures \code{a}'s
#' borrowing from \code{b}, which may not be the same as \code{f(b, a)} (e.g.,
#' \code{\link{ratio_of_matches}}). If \code{directional} is \code{FALSE},
#' then only the minimum number of comparisons will be made, i.e., the upper
#' triangle of the matrix. If \code{directional} is \code{TRUE}, then both
#' directional comparisons will be measured. In no case, however, will
#' documents be compared to themselves, i.e., the diagonal of the matrix.
#' @param progress Display a progress bar while comparing documents.
#'
#' @return A square matrix with dimensions equal to the length of the corpus,
#' and row and column names set by the names of the documents in the corpus. A
#' value of \code{NA} in the matrix indicates that a comparison was not made.
#' In cases of directional comparisons, then the comparison reported is
#' \code{f(row, column)}.
#'
#' @seealso See these document comparison functions,
#' \code{\link{jaccard_similarity}}, \code{\link{ratio_of_matches}}.
#'
#' @examples
#' dir <- system.file("extdata/legal", package = "textreuse")
#' corpus <- TextReuseCorpus(dir = dir)
#' names(corpus) <- filenames(names(corpus))
#'
#' # A non-directional comparison
#' pairwise_compare(corpus, jaccard_similarity)
#'
#' # A directional comparison
#' pairwise_compare(corpus, ratio_of_matches, directional = TRUE)
#' @export
pairwise_compare <- function(corpus, f, ..., directional = FALSE,
progress = interactive()) {
assert_that(is.TextReuseCorpus(corpus),
is.function(f))
len <- length(corpus)
ids <- names(corpus)
m <- matrix(0, len, len, dimnames = list(ids, ids))
if (!directional)
m[lower.tri(m, diag = TRUE)] <- NA
else
diag(m) <- NA
if (progress) {
num_pairs <- sum(!is.na(m))
message("Making ", prettyNum(num_pairs, big.mark = ","), " comparisons.")
pb <- txtProgressBar(min = 0, max = num_pairs, style = 3)
}
for (i in seq_along(m)) {
if (is.na(m[i])) next
indexes <- arrayInd(i, dim(m))
m[indexes] <- f(corpus[[indexes[1]]], corpus[[indexes[2]]])
if (progress) setTxtProgressBar(pb, getTxtProgressBar(pb) + 1)
}
if (progress) close(pb)
m
}
================================================
FILE: R/parallel.R
================================================
# Check if the option `mc.cores` has been set. If it has, return `mclapply`
# instead of `lapply`. But in no circumstances use `mclapply` on Windows.
using_parallel <- function() {
cores_set <- !is.null(getOption("mc.cores"))
windows <- .Platform$OS.type == "windows"
cores_set && !windows
}
get_apply_function <- function() {
if (using_parallel())
return(parallel::mclapply)
else
return(lapply)
}
================================================
FILE: R/rehash.R
================================================
#' Recompute the hashes for a document or corpus
#'
#' Given a \code{\link{TextReuseTextDocument}} or a
#' \code{\link{TextReuseCorpus}}, this function recomputes either the hashes or
#' the minhashes with the function specified. This implies that you have
#' retained the tokens with the \code{keep_tokens = TRUE} parameter.
#'
#' @param x A \code{\link{TextReuseTextDocument}} or
#' \code{\link{TextReuseCorpus}}.
#' @param func A function to either hash the tokens or to generate the minhash
#' signature. See \code{\link{hash_string}}, \code{\link{minhash_generator}}.
#' @param type Recompute the \code{hashes} or \code{minhashes}?
#'
#' @return The modified \code{\link{TextReuseTextDocument}} or
#' \code{\link{TextReuseCorpus}}.
#'
#' @examples
#' dir <- system.file("extdata/legal", package = "textreuse")
#' minhash1 <- minhash_generator(seed = 1)
#' corpus <- TextReuseCorpus(dir = dir, minhash_func = minhash1, keep_tokens = TRUE)
#' head(minhashes(corpus[[1]]))
#' minhash2 <- minhash_generator(seed = 2)
#' corpus <- rehash(corpus, minhash2, type = "minhashes")
#' head(minhashes(corpus[[2]]))
#'
#' @export
rehash <- function(x, func, type = c("hashes", "minhashes")) {
UseMethod("rehash", x)
}
#' @export
rehash.TextReuseTextDocument <- function(x, func,
type = c("hashes", "minhashes")) {
assert_that(has_tokens(x),
is.function(func))
type <- match.arg(type)
if (type == "hashes") {
x$hashes <- func(x$tokens)
x$meta$hash_func <- as.character(substitute(func))
} else if (type == "minhashes") {
x$minhashes <- func(x$tokens)
x$meta$minhash_func <- as.character(substitute(func))
}
x
}
#' @export
rehash.TextReuseCorpus <- function(x, func, type = c("hashes", "minhashes")) {
assert_that(is.function(func))
type <- match.arg(type)
apply_func <- get_apply_function()
x$documents <- apply_func(x$documents, rehash, func, type)
if (type == "hashes")
x$meta$hash_func <- as.character(substitute(func))
else if (type == "minhashes")
x$meta$minhash_func <- as.character(substitute(func))
x
}
================================================
FILE: R/similarity.R
================================================
#' Measure similarity/dissimilarity in documents
#'
#' A set of functions which take two sets or bag of words and measure their
#' similarity or dissimilarity.
#'
#' @details The functions \code{jaccard_similarity} and
#' \code{jaccard_dissimilarity} provide the Jaccard measures of similarity or
#' dissimilarity for two sets. The coefficients will be numbers between
#' \code{0} and \code{1}. For the similarity coefficient, the higher the
#' number the more similar the two sets are. When applied to two documents of
#' class \code{\link{TextReuseTextDocument}}, the hashes in those documents
#' are compared. But this function can be passed objects of any class accepted
#' by the set functions in base R. So it is possible, for instance, to pass
#' this function two character vectors comprised of word, line, sentence, or
#' paragraph tokens, or those character vectors hashed as integers.
#'
#' The Jaccard similarity coeffecient is defined as follows:
#'
#' \deqn{J(A, B) = \frac{ | A \cap B | }{ | A \cup B | }}{ length(intersect(a,
#' b)) / length(union(a, b))}
#'
#' The Jaccard dissimilarity is simply
#'
#' \deqn{1 - J(A, B)}
#'
#' The function \code{jaccard_bag_similarity} treats \code{a} and \code{b} as
#' bags rather than sets, so that the result is a fraction where the numerator
#' is the sum of each matching element counted the minimum number of times it
#' appears in each bag, and the denominator is the sum of the lengths of both
#' bags. The maximum value for the Jaccard bag similarity is \code{0.5}.
#'
#' The function \code{ratio_of_matches} finds the ratio between the number of
#' items in \code{b} that are also in \code{a} and the total number of items
#' in \code{b}. Note that this similarity measure is directional: it measures
#' how much \code{b} borrows from \code{a}, but says nothing about how much of
#' \code{a} borrows from \code{b}.
#'
#' The function \code{count_matches} returns the numerator used by
#' \code{ratio_of_matches}: the number of items in \code{b} also found in
#' \code{a}. The function \code{matching_tokens} returns those matching items
#' from \code{b}, preserving their order and duplicates.
#'
#' @param a The first set (or bag) to be compared. The origin bag for
#' directional comparisons.
#' @param b The second set (or bag) to be compared. The destination bag for
#' directional comparisons.
#'
#' @examples
#' jaccard_similarity(1:6, 3:10)
#' jaccard_dissimilarity(1:6, 3:10)
#'
#' a <- c("a", "a", "a", "b")
#' b <- c("a", "a", "b", "b", "c")
#' jaccard_similarity(a, b)
#' jaccard_bag_similarity(a, b)
#' ratio_of_matches(a, b)
#' ratio_of_matches(b, a)
#' count_matches(a, b)
#' matching_tokens(a, b)
#'
#' ny <- system.file("extdata/legal/ny1850-match.txt", package = "textreuse")
#' ca_match <- system.file("extdata/legal/ca1851-match.txt", package = "textreuse")
#' ca_nomatch <- system.file("extdata/legal/ca1851-nomatch.txt", package = "textreuse")
#'
#' ny <- TextReuseTextDocument(file = ny,
#' meta = list(id = "ny"))
#' ca_match <- TextReuseTextDocument(file = ca_match,
#' meta = list(id = "ca_match"))
#' ca_nomatch <- TextReuseTextDocument(file = ca_nomatch,
#' meta = list(id = "ca_nomatch"))
#'
#' # These two should have higher similarity scores
#' jaccard_similarity(ny, ca_match)
#' ratio_of_matches(ny, ca_match)
#'
#' # These two should have lower similarity scores
#' jaccard_similarity(ny, ca_nomatch)
#' ratio_of_matches(ny, ca_nomatch)
#'
#' @references Jure Leskovec, Anand Rajaraman, and Jeff Ullman,
#' \emph{Mining of Massive Datasets} (Cambridge University Press, 2011).
#' @name similarity-functions
NULL
#' @rdname similarity-functions
#' @export
jaccard_similarity <- function(a, b) UseMethod("jaccard_similarity")
#' @export
jaccard_similarity.default <- function(a, b) {
assert_that(all(class(a) == class(b)))
length(intersect(a, b)) / length(union(a, b))
}
#' @export
jaccard_similarity.TextReuseTextDocument <- function(a, b) {
assert_that(all(class(a) == class(b)))
jaccard_similarity(a$hashes, b$hashes)
}
#' @rdname similarity-functions
#' @export
jaccard_dissimilarity <- function(a, b) UseMethod("jaccard_dissimilarity")
#' @export
jaccard_dissimilarity.default <- function(a, b) {
1 - jaccard_similarity(a, b)
}
#' @rdname similarity-functions
#' @export
jaccard_bag_similarity <- function(a, b) UseMethod("jaccard_bag_similarity")
#' @export
jaccard_bag_similarity.default <- function(a, b) {
matches <- intersect(a, b)
counts <- vapply(matches, function(x) min(sum(x == a), sum(x == b)),
integer(1), USE.NAMES = FALSE)
denominator <- length(a) + length(b)
sum(counts) / denominator
}
#' @export
jaccard_bag_similarity.TextReuseTextDocument <- function(a, b) {
assert_that(all(class(a) == class(b)))
jaccard_bag_similarity(a$hashes, b$hashes)
}
#' @export
#' @rdname similarity-functions
ratio_of_matches <- function(a, b) UseMethod("ratio_of_matches")
#' @export
ratio_of_matches.default <- function(a, b) {
assert_that(all(class(a) == class(b)))
sum(b %in% a) / length(b)
}
#' @export
ratio_of_matches.TextReuseTextDocument <- function(a, b) {
assert_that(all(class(a) == class(b)))
ratio_of_matches(a$hashes, b$hashes)
}
#' @export
#' @rdname similarity-functions
count_matches <- function(a, b) UseMethod("count_matches")
#' @export
count_matches.default <- function(a, b) {
length(matching_tokens(a, b))
}
#' @export
count_matches.TextReuseTextDocument <- function(a, b) {
assert_that(all(class(a) == class(b)))
count_matches(a$hashes, b$hashes)
}
#' @export
#' @rdname similarity-functions
matching_tokens <- function(a, b) UseMethod("matching_tokens")
#' @export
matching_tokens.default <- function(a, b) {
assert_that(all(class(a) == class(b)))
b[b %in% a]
}
#' @export
matching_tokens.TextReuseTextDocument <- function(a, b) {
assert_that(all(class(a) == class(b)),
has_tokens(a),
has_tokens(b))
matching_tokens(a$tokens, b$tokens)
}
================================================
FILE: R/textreuse-package.r
================================================
#' @details
#' The best place to begin with this package in the introductory vignette.
#'
#' \code{vignette("textreuse-introduction", package = "textreuse")}
#'
#' After reading that vignette, the "pairwise" and "minhash" vignettes introduce
#' specific paths for working with the package.
#'
#' \code{vignette("textreuse-pairwise", package = "textreuse")}
#'
#' \code{vignette("textreuse-minhash", package = "textreuse")}
#'
#' \code{vignette("textreuse-alignment", package = "textreuse")}
#'
#' Another good place to begin with the package is the documentation for loading
#' documents (\code{\link{TextReuseTextDocument}} and
#' \code{\link{TextReuseCorpus}}), for \link{tokenizers},
#' \link[=similarity-functions]{similarity functions}, and
#' \link[=lsh]{locality-sensitive hashing}.
#'
#' @references The sample data provided in the \code{extdata/ats} directory
#' contains nineteenth-century American Tract Society publications gathered
#' from the \href{https://archive.org/}{Internet Archive}.
#'
#' The sample data provided in the \code{extdata/legal} directory, are taken
#' from the following nineteenth-century codes of civil procedure from
#' California and New York.
#'
#' \emph{Final Report of the Commissioners on Practice and Pleadings}, in 2
#' \emph{Documents of the Assembly of New York}, 73rd Sess., No. 16, (1850):
#' 243-250, sections 597-613.
#' \href{http://books.google.com/books?id=9HEbAQAAIAAJ&pg=PA243#v=onepage&q&f=false}{Google
#' Books}.
#'
#' \emph{An Act To Regulate Proceedings in Civil Cases}, 1851 \emph{California
#' Laws} 51, 51-53 sections 4-17; 101, sections 313-316.
#' \href{http://books.google.com/books?id=4PHEAAAAIAAJ&pg=PA51#v=onepage&q&f=false}{Google
#' Books}.
#'
#' @useDynLib textreuse, .registration = TRUE
#' @importFrom Rcpp sourceCpp
#' @import RcppProgress
#' @import stringr
#' @import assertthat
#' @importFrom utils getTxtProgressBar setTxtProgressBar txtProgressBar
"_PACKAGE"
if (getRversion() >= "2.15.1") {
utils::globalVariables(c("doc.x", "doc.y", "up", "dn", "a", "b", ".data", "band", "hash"))
}
================================================
FILE: R/token_index.R
================================================
#' Build an index of tokens and documents
#'
#' Build an inverted index from tokens to the documents that contain them. This
#' is useful for finding document pairs that share one or more n-grams without
#' comparing every document pair. The corpus must be created with
#' \code{keep_tokens = TRUE}.
#'
#' @param corpus A \code{\link{TextReuseCorpus}} with retained tokens.
#' @param min_doc_count Minimum number of documents a token must appear in to
#' be retained. Increase this to remove rare tokens.
#' @param max_doc_count Maximum number of documents a token may appear in to be
#' retained. Decrease this to remove very common tokens.
#' @return A \code{textreuse_token_index} data frame with columns \code{token},
#' \code{docs}, and \code{n_docs}.
#' @export
token_index <- function(corpus, min_doc_count = 2, max_doc_count = Inf) {
assert_that(is.TextReuseCorpus(corpus),
is.count(min_doc_count),
is.number(max_doc_count),
all(vapply(tokens(corpus), Negate(is.null), logical(1))))
entries <- lapply(names(corpus), function(doc_id) {
tibble::tibble(token = unique(tokens(corpus[[doc_id]])), doc = doc_id)
})
index <- dplyr::bind_rows(entries) %>%
dplyr::group_by(.data$token) %>%
dplyr::summarize(docs = list(sort(.data$doc)),
n_docs = dplyr::n(),
.groups = "drop") %>%
dplyr::filter(.data$n_docs >= min_doc_count,
.data$n_docs <= max_doc_count) %>%
dplyr::arrange(.data$token)
class(index) <- c("textreuse_token_index", class(index))
index
}
#' Extract candidate document pairs from a token index
#'
#' @param index A \code{textreuse_token_index} object returned by
#' \code{\link{token_index}}.
#' @return A \code{textreuse_candidates} data frame.
#' @export
token_index_candidates <- function(index) {
assert_that(inherits(index, "textreuse_token_index"))
pair_matrices <- lapply(index$docs, function(doc_ids) {
if (length(doc_ids) < 2) return(NULL)
t(utils::combn(doc_ids, 2))
})
pair_matrices <- Filter(Negate(is.null), pair_matrices)
if (length(pair_matrices) == 0) {
candidates <- tibble::tibble(a = character(), b = character(),
score = numeric())
} else {
candidates <- do.call(rbind, pair_matrices) %>%
as.data.frame(stringsAsFactors = FALSE) %>%
tibble::as_tibble() %>%
stats::setNames(c("a", "b")) %>%
dplyr::mutate(a = pmin(.data$a, .data$b),
b = pmax(.data$a, .data$b)) %>%
dplyr::distinct(.data$a, .data$b) %>%
dplyr::arrange(.data$a, .data$b) %>%
dplyr::mutate(score = NA_real_)
}
class(candidates) <- c("textreuse_candidates", class(candidates))
candidates
}
================================================
FILE: R/tokenize.R
================================================
#' Recompute the tokens for a document or corpus
#'
#' Given a \code{\link{TextReuseTextDocument}} or a
#' \code{\link{TextReuseCorpus}}, this function recomputes the tokens and hashes
#' with the functions specified. Optionally, it can also recompute the minhash signatures.
#'
#' @param x A \code{\link{TextReuseTextDocument}} or
#' \code{\link{TextReuseCorpus}}.
#' @param tokenizer A function to split the text into tokens. See
#' \code{\link{tokenizers}}.
#' @param ... Arguments passed on to the \code{tokenizer}.
#' @param hash_func A function to hash the tokens. See
#' \code{\link{hash_string}}.
#' @param minhash_func A function to create minhash signatures. See
#' \code{\link{minhash_generator}}.
#' @param keep_tokens Should the tokens be saved in the document that is
#' returned or discarded?
#' @param keep_text Should the text be saved in the document that is returned or
#' discarded?
#'
#' @return The modified \code{\link{TextReuseTextDocument}} or
#' \code{\link{TextReuseCorpus}}.
#'
#' @examples
#' dir <- system.file("extdata/legal", package = "textreuse")
#' corpus <- TextReuseCorpus(dir = dir, tokenizer = NULL)
#' corpus <- tokenize(corpus, tokenize_ngrams)
#' head(tokens(corpus[[1]]))
#' @export
tokenize <- function(x, tokenizer, ..., hash_func = hash_string,
minhash_func = NULL, keep_tokens = FALSE,
keep_text = TRUE) {
UseMethod("tokenize", x)
}
#' @export
tokenize.TextReuseTextDocument <- function(x, tokenizer, ...,
hash_func = hash_string,
minhash_func = NULL,
keep_tokens = TRUE,
keep_text = TRUE) {
assert_that(has_content(x),
is.function(tokenizer),
is.function(hash_func))
x$tokens <- tokenizer(x$content, ...)
x$hashes <- hash_func(x$tokens)
if (!keep_tokens) x$tokens <- NULL
if (!keep_text) x$text <- NULL
x$meta$tokenizer <- as.character(substitute(tokenizer))
x$meta$hash_func <- as.character(substitute(hash_func))
if (!is.null(minhash_func)) {
x$minhash <- minhash_func(x$tokens)
x$meta$minhash_func <- as.character(substitute(minhash_func))
} else {
# If tokens are redone, minhashes are invalid, so delete them if they are
# not also recomputed.
x$minhashes <- NULL
x$meta$minhash_func <- NULL
}
x
}
#' @export
tokenize.TextReuseCorpus <- function(x, tokenizer, ..., hash_func = hash_string,
minhash_func = NULL, keep_tokens = TRUE,
keep_text = TRUE) {
apply_func <- get_apply_function()
x$documents <- apply_func(x$documents, tokenize, tokenizer, ...,
hash_func = hash_func, minhash_func = minhash_func,
keep_tokens = keep_tokens, keep_text = keep_text)
x$meta$tokenizer <- as.character(substitute(tokenizer))
x$meta$hash_func <- as.character(substitute(hash_func))
if (!is.null(minhash_func)) {
x$meta$minhash_func <- as.character(substitute(minhash_func))
} else {
x$meta$minhash_func <- NULL
}
x
}
================================================
FILE: R/tokenizers.R
================================================
#' Split texts into tokens
#'
#' These functions each turn a text into tokens. The \code{tokenize_ngrams}
#' functions returns shingled n-grams.
#'
#' @name tokenizers
#' @param string A character vector of length 1 to be tokenized.
#' @param lowercase Should the tokens be made lower case?
#' @param n For n-gram tokenizers, the number of words in each n-gram.
#' @param k For the skip n-gram tokenizer, the maximum skip distance between
#' words. The function will compute all skip n-grams between \code{0} and
#' \code{k}.
#' @details These functions will strip all punctuation.
#' @return A character vector containing the tokens.
#' @examples
#' dylan <- "How many roads must a man walk down? The answer is blowin' in the wind."
#' tokenize_words(dylan)
#' tokenize_sentences(dylan)
#' tokenize_ngrams(dylan, n = 2)
#' tokenize_skip_ngrams(dylan, n = 3, k = 2)
NULL
#' @export
#' @rdname tokenizers
tokenize_words <- function(string, lowercase = TRUE) {
assert_that(assertthat::is.string(string))
out <- str_split(string, boundary("word"))[[1]]
if (lowercase) str_to_lower(out) else out
}
#' @export
#' @rdname tokenizers
tokenize_sentences <- function(string, lowercase = TRUE) {
assert_that(assertthat::is.string(string))
out <- str_split(string, boundary("sentence", skip_word_none = FALSE))[[1]]
out <- str_replace_all(out, "[[:punct:]]", " ")
out <- str_replace_all(out, "\\s+", " ")
out <- str_trim(out)
if (lowercase) str_to_lower(out) else out
}
#' @export
#' @rdname tokenizers
tokenize_ngrams <- function(string, lowercase = TRUE, n = 3) {
assert_that(is.count(n),
assertthat::is.string(string))
words <- tokenize_words(string, lowercase = lowercase)
assert_that(n < length(words))
shingle_ngrams(words, n = n)
}
#' @export
#' @rdname tokenizers
tokenize_skip_ngrams <- function(string, lowercase = TRUE, n = 3, k = 1) {
assert_that(is.count(n),
is.count(k) | k == 0,
assertthat::is.string(string))
words <- tokenize_words(string, lowercase = lowercase)
assert_that(n + n * k - k <= length(words))
skip_ngrams(words, n = n, k = k)
}
================================================
FILE: R/utils.R
================================================
# Take results of readLines and turn it into a character vector of length 1
as_string <- function(x) {
x %>%
str_c(collapse = "\n") %>%
NLP::as.String()
}
# Pretty print the metadata for a document
pretty_print_metadata <- function(doc) {
lapply(names(doc$meta), function(x) cat(x, ":", doc$meta[[x]], "\n"))
}
# Check whether the number of minhashes is evenly divisible by number of bands
check_banding <- function(h, b) {
h %% b == 0
}
assertthat::on_failure(check_banding) <- function(call, env) {
"The number of hashes must be evenly divisible by the number of bands."
}
# Sequences for subsetting by bands in minhash
band_seq <- function(l, b) {
assert_that(check_banding(l, b))
r <- l / b
starts <- seq.int(from = 1, to = l, by = r)
lapply(starts, function(n) seq.int(n, n + r - 1, 1))
}
# Test that meta exists and that it has an ID value
has_id <- function(meta) {
!is.null(meta$id)
}
assertthat::on_failure(has_id) <- function(call, env) {
paste("When creating a document from a string instead of a file, the `id`",
"field in the metadata list must be specified.")
}
# People might row_bind() two of these data frames, so we can't rely just on
# the class.
is_lsh_buckets <- function(x) {
identical(names(x), c("doc", "buckets")) & inherits(x, "data.frame")
}
assertthat::on_failure(is_lsh_buckets) <- function(call, env) {
"Object is not a data frame of LSH buckets."
}
# People might run a candidates data frame through dplyr so that it loses its
# class.
is_candidates_df <- function(x) {
class_check <- inherits(x, "textreuse_candidates")
col_check <- all(c("a", "b", "score") %in% names(x)) & inherits(x, "data.frame")
class_check | col_check
}
assertthat::on_failure(is_candidates_df) <- function(call, env) {
"Object is not a candidates data frame."
}
is_integer_like <- function(x) {
is.integer(x) | (is.scalar(x) & (x == as.integer(x)))
}
assertthat::on_failure(is_integer_like) <- function(call, env) {
paste0(deparse(call$x), " is not a whole number.")
}
sort_meta <- function(meta) {
meta[order(names(meta))]
}
sort_df_by_rows <- function(df) {
assert_that(all(c("a", "b") %in% colnames(df)),
is.data.frame(df))
for (i in seq_len(nrow(df))) {
ordered <- sort(c(df[i, "a"], df[i, "b"]))
df[i, "a"] <- ordered[1]
df[i, "b"] <- ordered[2]
}
df
}
sort_df_by_columns <- function(df) {
assert_that(all(c("a", "b") %in% colnames(df)),
is.data.frame(df))
df <- df[with(df, order(a, b)), ]
# rownames(df) <- NULL
df
}
# Given a word, create a string with the same number of marker characters
mark_chars <- function(word, char) {
str_c(rep(char, str_length(word)), collapse = "")
}
================================================
FILE: R/wordcount.R
================================================
#' Count words
#'
#' This function counts words in a text, for example, a character vector, a
#' \code{\link{TextReuseTextDocument}}, some other object that inherits from
#' \code{\link[NLP]{TextDocument}}, or a all the documents in a
#' \code{\link{TextReuseCorpus}}.
#'
#' @param x The object containing a text.
#' @export
#' @return An integer vector for the word count.
wordcount <- function(x) UseMethod("wordcount", x)
#' @export
wordcount.default <- function(x) {
assert_that(is.string(x))
str_count(x, boundary("word"))
}
#' @export
wordcount.TextDocument <- function(x) wordcount(x$content)
#' @export
wordcount.TextReuseCorpus <- function(x) {
vapply(x$documents, wordcount, integer(1))
}
================================================
FILE: README.Rmd
================================================
---
output: md_document
title: Detect Text Reuse and Document Similarity
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE, warning = FALSE, message = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
warning = FALSE,
fig.path = "README-"
)
suppressWarnings(suppressPackageStartupMessages(library(dplyr)))
```
# textreuse
[](https://cran.r-project.org/package=textreuse)
[](https://cran.r-project.org/package=textreuse)
[](https://app.codecov.io/github/ropensci/textreuse?branch=master)
[](https://github.com/ropensci/software-review/issues/20)
## Overview
This [R](https://www.r-project.org/) package provides a set of functions for measuring similarity among documents and detecting passages which have been reused. It implements shingled n-gram, skip n-gram, and other tokenizers; similarity/dissimilarity functions; pairwise comparisons; minhash and locality sensitive hashing algorithms; and a version of the Smith-Waterman local alignment algorithm suitable for natural language. It is broadly useful for, for example, detecting duplicate documents in a corpus prior to text analysis, or for identifying borrowed passages between texts. The classes provided by this package follow the model of other natural language processing packages for R, especially the [NLP](https://cran.r-project.org/package=NLP) and [tm](https://cran.r-project.org/package=tm) packages. (However, this package has no dependency on Java, which should make it easier to install.)
### Citation
If you use this package for scholarly research, I would appreciate a citation.
```{r}
citation("textreuse")
```
## Installation
To install this package from CRAN:
```{r eval=FALSE}
install.packages("textreuse")
```
To install the development version from GitHub, use [devtools](https://github.com/r-lib/devtools).
```{r eval=FALSE}
# install.packages("devtools")
devtools::install_github("ropensci/textreuse", build_vignettes = TRUE)
```
## Examples
There are three main approaches that one may take when using this package: pairwise comparisons, minhashing/locality sensitive hashing, and extracting matching passages through text alignment.
See the [introductory vignette](https://docs.ropensci.org/textreuse/articles/textreuse-introduction.html) for a description of the classes provided by this package.
```{r eval = FALSE}
vignette("textreuse-introduction", package = "textreuse")
```
### Pairwise comparisons
In this example we will load a tiny corpus of three documents. These documents are drawn from Kellen Funk's [research](https://kellenfunk.org/field-code/) into the propagation of legal codes of civil procedure in the nineteenth-century United States.
```{r}
library(textreuse)
dir <- system.file("extdata/legal", package = "textreuse")
corpus <- TextReuseCorpus(dir = dir, meta = list(title = "Civil procedure"),
tokenizer = tokenize_ngrams, n = 7)
```
We have loaded the three documents into a corpus, which involves tokenizing the text and hashing the tokens. We can inspect the corpus as a whole or the individual documents that make it up.
```{r}
corpus
names(corpus)
corpus[["ca1851-match"]]
```
Now we can compare each of the documents to one another. The `pairwise_compare()` function applies a comparison function (in this case, `jaccard_similarity()`) to every pair of documents. The result is a matrix of scores. As we would expect, some documents are similar and others are not.
```{r}
comparisons <- pairwise_compare(corpus, jaccard_similarity)
comparisons
```
We can convert that matrix to a data frame of pairs and scores if we prefer.
```{r}
pairwise_candidates(comparisons)
```
See the [pairwise vignette](https://docs.ropensci.org/textreuse/articles/textreuse-pairwise.html) for a fuller description.
```{r eval=FALSE}
vignette("textreuse-pairwise", package = "textreuse")
```
### Minhashing and locality sensitive hashing
Pairwise comparisons can be very time-consuming because they grow geometrically with the size of the corpus. (A corpus with 10 documents would require at least 45 comparisons; a corpus with 100 documents would require 4,950 comparisons; a corpus with 1,000 documents would require 499,500 comparisons.) That's why this package implements the minhash and locality sensitive hashing algorithms, which can detect candidate pairs much faster than pairwise comparisons in corpora of any significant size.
For this example we will load a small corpus of ten documents published by the American Tract Society. We will also create a minhash function, which represents an entire document (regardless of length) by a fixed number of integer hashes. When we create the corpus, the documents will each have a minhash signature.
```{r}
dir <- system.file("extdata/ats", package = "textreuse")
minhash <- minhash_generator(200, seed = 235)
ats <- TextReuseCorpus(dir = dir,
tokenizer = tokenize_ngrams, n = 5,
minhash_func = minhash)
```
Now we can calculate potential matches, extract the candidates, and apply a comparison function to just those candidates.
```{r}
buckets <- lsh(ats, bands = 50, progress = FALSE)
candidates <- lsh_candidates(buckets)
scores <- lsh_compare(candidates, ats, jaccard_similarity, progress = FALSE)
scores
```
For details, see the [minhash vignette](https://docs.ropensci.org/textreuse/articles/textreuse-minhash.html).
```{r eval=FALSE}
vignette("textreuse-minhash", package = "textreuse")
```
### Text alignment
We can also extract the optimal alignment between two documents with a version of the [Smith-Waterman](https://en.wikipedia.org/wiki/Smith-Waterman_algorithm) algorithm, used for protein sequence alignment, adapted for natural language. The longest matching substring according to scoring values will be extracted, and variations in the alignment will be marked.
```{r}
a <- "'How do I know', she asked, 'if this is a good match?'"
b <- "'This is a match', he replied."
align_local(a, b)
```
For details, see the [text alignment vignette](https://docs.ropensci.org/textreuse/articles/textreuse-alignment.html).
```{r eval=FALSE}
vignette("textreuse-alignment", package = "textreuse")
```
### Parallel processing
Loading the corpus and creating tokens benefit from using multiple cores, if available. (This works only on non-Windows machines.) To use multiple cores, set `options("mc.cores" = 4L)`, where the number is how many cores you wish to use.
### Contributing and acknowledgments
Please note that this project is released with a [Contributor Code of Conduct](https://github.com/ropensci/textreuse/blob/master/CONDUCT.md). By participating in this project you agree to abide by its terms.
Thanks to [Noam Ross](https://www.noamross.net/) for his thorough [peer review](https://github.com/ropensci/software-review/issues/20) of this package for [rOpenSci](https://ropensci.org/).
------------------------------------------------------------------------
[](https://ropensci.org)
================================================
FILE: README.md
================================================
<!-- README.md is generated from README.Rmd. Please edit that file -->
# textreuse
[](https://cran.r-project.org/package=textreuse)
[](https://cran.r-project.org/package=textreuse)
[](https://app.codecov.io/github/ropensci/textreuse?branch=master)
[](https://github.com/ropensci/software-review/issues/20)
## Overview
This [R](https://www.r-project.org/) package provides a set of functions
for measuring similarity among documents and detecting passages which
have been reused. It implements shingled n-gram, skip n-gram, and other
tokenizers; similarity/dissimilarity functions; pairwise comparisons;
minhash and locality sensitive hashing algorithms; and a version of the
Smith-Waterman local alignment algorithm suitable for natural language.
It is broadly useful for, for example, detecting duplicate documents in
a corpus prior to text analysis, or for identifying borrowed passages
between texts. The classes provided by this package follow the model of
other natural language processing packages for R, especially the
[NLP](https://cran.r-project.org/package=NLP) and
[tm](https://cran.r-project.org/package=tm) packages. (However, this
package has no dependency on Java, which should make it easier to
install.)
### Citation
If you use this package for scholarly research, I would appreciate a
citation.
citation("textreuse")
#> To cite package 'textreuse' in publications use:
#>
#> Mullen L, Li Y (2026). _textreuse: Detect Text Reuse and Document
#> Similarity_. R package version 1.0.1,
#> https://github.com/ropensci/textreuse,
#> <https://docs.ropensci.org/textreuse/>.
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Manual{,
#> title = {textreuse: Detect Text Reuse and Document Similarity},
#> author = {Lincoln Mullen and Yaoxiang Li},
#> year = {2026},
#> note = {R package version 1.0.1,
#> https://github.com/ropensci/textreuse},
#> url = {https://docs.ropensci.org/textreuse/},
#> }
## Installation
To install this package from CRAN:
install.packages("textreuse")
To install the development version from GitHub, use
[devtools](https://github.com/r-lib/devtools).
# install.packages("devtools")
devtools::install_github("ropensci/textreuse", build_vignettes = TRUE)
## Examples
There are three main approaches that one may take when using this
package: pairwise comparisons, minhashing/locality sensitive hashing,
and extracting matching passages through text alignment.
See the [introductory
vignette](https://docs.ropensci.org/textreuse/articles/textreuse-introduction.html)
for a description of the classes provided by this package.
vignette("textreuse-introduction", package = "textreuse")
### Pairwise comparisons
In this example we will load a tiny corpus of three documents. These
documents are drawn from Kellen Funk’s
[research](https://kellenfunk.org/field-code/) into the propagation of
legal codes of civil procedure in the nineteenth-century United States.
library(textreuse)
dir <- system.file("extdata/legal", package = "textreuse")
corpus <- TextReuseCorpus(dir = dir, meta = list(title = "Civil procedure"),
tokenizer = tokenize_ngrams, n = 7)
We have loaded the three documents into a corpus, which involves
tokenizing the text and hashing the tokens. We can inspect the corpus as
a whole or the individual documents that make it up.
corpus
#> TextReuseCorpus
#> Number of documents: 3
#> hash_func : hash_string
#> title : Civil procedure
#> tokenizer : tokenize_ngrams
names(corpus)
#> [1] "ca1851-match" "ca1851-nomatch" "ny1850-match"
corpus[["ca1851-match"]]
#> TextReuseTextDocument
#> file : C:/Users/Bach/AppData/Local/R/win-library/4.4/textreuse/extdata/legal/ca1851-match.txt
#> hash_func : hash_string
#> id : ca1851-match
#> minhash_func :
#> tokenizer : tokenize_ngrams
#> content : § 4. Every action shall be prosecuted in the name of the real party
#> in interest, except as otherwise provided in this Act.
#>
#> § 5. In the case of an assignment of a thing in action, the action by
#> the as
Now we can compare each of the documents to one another. The
`pairwise_compare()` function applies a comparison function (in this
case, `jaccard_similarity()`) to every pair of documents. The result is
a matrix of scores. As we would expect, some documents are similar and
others are not.
comparisons <- pairwise_compare(corpus, jaccard_similarity)
comparisons
#> ca1851-match ca1851-nomatch ny1850-match
#> ca1851-match NA 0 0.3842549
#> ca1851-nomatch NA NA 0.0000000
#> ny1850-match NA NA NA
We can convert that matrix to a data frame of pairs and scores if we
prefer.
pairwise_candidates(comparisons)
#> # A tibble: 3 × 3
#> a b score
#> * <chr> <chr> <dbl>
#> 1 ca1851-match ca1851-nomatch 0
#> 2 ca1851-match ny1850-match 0.384
#> 3 ca1851-nomatch ny1850-match 0
See the [pairwise
vignette](https://docs.ropensci.org/textreuse/articles/textreuse-pairwise.html)
for a fuller description.
vignette("textreuse-pairwise", package = "textreuse")
### Minhashing and locality sensitive hashing
Pairwise comparisons can be very time-consuming because they grow
geometrically with the size of the corpus. (A corpus with 10 documents
would require at least 45 comparisons; a corpus with 100 documents would
require 4,950 comparisons; a corpus with 1,000 documents would require
499,500 comparisons.) That’s why this package implements the minhash and
locality sensitive hashing algorithms, which can detect candidate pairs
much faster than pairwise comparisons in corpora of any significant
size.
For this example we will load a small corpus of ten documents published
by the American Tract Society. We will also create a minhash function,
which represents an entire document (regardless of length) by a fixed
number of integer hashes. When we create the corpus, the documents will
each have a minhash signature.
dir <- system.file("extdata/ats", package = "textreuse")
minhash <- minhash_generator(200, seed = 235)
ats <- TextReuseCorpus(dir = dir,
tokenizer = tokenize_ngrams, n = 5,
minhash_func = minhash)
Now we can calculate potential matches, extract the candidates, and
apply a comparison function to just those candidates.
buckets <- lsh(ats, bands = 50, progress = FALSE)
candidates <- lsh_candidates(buckets)
scores <- lsh_compare(candidates, ats, jaccard_similarity, progress = FALSE)
scores
#> # A tibble: 1 × 3
#> a b score
#> <chr> <chr> <dbl>
#> 1 remember00palm remembermeorholy00palm 0.701
For details, see the [minhash
vignette](https://docs.ropensci.org/textreuse/articles/textreuse-minhash.html).
vignette("textreuse-minhash", package = "textreuse")
### Text alignment
We can also extract the optimal alignment between two documents with a
version of the
[Smith-Waterman](https://en.wikipedia.org/wiki/Smith-Waterman_algorithm)
algorithm, used for protein sequence alignment, adapted for natural
language. The longest matching substring according to scoring values
will be extracted, and variations in the alignment will be marked.
a <- "'How do I know', she asked, 'if this is a good match?'"
b <- "'This is a match', he replied."
align_local(a, b)
#> TextReuse alignment
#> Alignment score: 7
#> Document A:
#> this is a good match
#>
#> Document B:
#> This is a #### match
For details, see the [text alignment
vignette](https://docs.ropensci.org/textreuse/articles/textreuse-alignment.html).
vignette("textreuse-alignment", package = "textreuse")
### Parallel processing
Loading the corpus and creating tokens benefit from using multiple
cores, if available. (This works only on non-Windows machines.) To use
multiple cores, set `options("mc.cores" = 4L)`, where the number is how
many cores you wish to use.
### Contributing and acknowledgments
Please note that this project is released with a [Contributor Code of
Conduct](https://github.com/ropensci/textreuse/blob/master/CONDUCT.md).
By participating in this project you agree to abide by its terms.
Thanks to [Noam Ross](https://www.noamross.net/) for his thorough [peer
review](https://github.com/ropensci/software-review/issues/20) of this
package for [rOpenSci](https://ropensci.org/).
------------------------------------------------------------------------
[](https://ropensci.org)
================================================
FILE: _pkgdown.yml
================================================
url: https://docs.ropensci.org/textreuse/
template:
bootstrap: 5
bootswatch: united
authors:
Yaoxiang Li:
href: "https://github.com/yaoxiangli"
================================================
FILE: appveyor.yml
================================================
# DO NOT CHANGE the "init" and "install" sections below
# Download script file from GitHub
init:
ps: |
$ErrorActionPreference = "Stop"
Invoke-WebRequest http://raw.github.com/krlmlr/r-appveyor/master/scripts/appveyor-tool.ps1 -OutFile "..\appveyor-tool.ps1"
Import-Module '..\appveyor-tool.ps1'
install:
ps: Bootstrap
# Adapt as necessary starting from here
build: off
build_script:
- travis-tool.sh install_deps
test_script:
- travis-tool.sh run_tests
on_failure:
- travis-tool.sh dump_logs
artifacts:
- path: '*.Rcheck\**\*.log'
name: Logs
- path: '*.Rcheck\**\*.out'
name: Logs
- path: '*.Rcheck\**\*.fail'
name: Logs
- path: '*.Rcheck\**\*.Rout'
name: Logs
- path: '\*_*.tar.gz'
name: Bits
- path: '\*_*.zip'
name: Bits
================================================
FILE: cran-comments.md
================================================
This is a new release with bug fixes, documentation refreshes, and helper
functions added after a long maintenance interval.
This resubmission fixes a moved URL in README.md:
https://github.com/hadley/devtools was replaced with
https://github.com/r-lib/devtools.
This is also a maintainer change release. The maintainer has changed from
Lincoln Mullen <lincoln@lincolnmullen.com> to Yaoxiang Li
<liyaoxiang@outlook.com>. The previous maintainer, Lincoln Mullen, has
confirmed by email that he supports the maintainer transition. I can provide
the email thread if requested.
## Test environments
* local Windows 11 install: R 4.4.2
## R CMD check results
There were no ERRORs or WARNINGs.
Local checks were run with:
`R CMD check --no-manual textreuse_1.0.1.tar.gz`
`R CMD check --as-cran --no-manual textreuse_1.0.1.tar.gz`
The `--as-cran` check reported three NOTEs:
* This release changes the maintainer from Lincoln Mullen to Yaoxiang Li.
* The local Windows check was unable to verify the current time.
* The local Windows check reported that README.md or NEWS.md could not be
checked without pandoc. README.md was regenerated locally with rmarkdown,
and the pkgdown site was built locally with RStudio Pandoc before release.
There were no invalid URL NOTEs.
## Downstream dependencies
There are no known downstream dependency issues.
================================================
FILE: inst/extdata/ats/calltounconv00baxt.txt
================================================
Glass.
Book,____._._
(ttmmtixc
mm¥m %m(m>m
v'OJj
A
CALL
TO
THE TTNCONVEETS9.
BY REV. RICHARD BAXTER.
AN INTRODUCTORY ESSAY,
BY RCV. THOMAS CHALMERS, D. D.
PDBUSHED BY THE
AMERICAN TRACT SOCIETY,
150 NASSAU-STREET, KEWYORK.
D. Fanshaw, Printer.
DR. CHALMERS'
|f INTRODUCTORY ESSAY,
ABRIDGED.
The " Call to the Unconverted " by Richard Bax •
ter, is characterized by all that solemn earnestness,
and urgency of appeal, for winch the writings of this
much-admired author are so peculiarly distinguished.
He seems to look upon mankind solely with the eyes
of the Spirit, and exclusively to recognize them in
their spiritual relations, and in the great and essential
elements of their immortal being. Their future des-
tiny is the all-important concern which fills and en-
grosses his mind, and he regards nothing of any mag-
nitude but what has a distinct bearing on their spiri-
tual and eternal condition. His business, therefore, is
always with the conscience, to which he makes the
most forcible appeals, ami which he plies with all
those arguments which are fitted to awaken the sinner
to a deep sense of the necessity and importance of im-
mediate repentance. He endeavors to move him by
the most touching of all representations, the tender-
ness of a beseeching God waiting to be gracious, and
not willing that any should perish ; and while he em-
ploys every form of entreaty, which tenderness and
compassion can suggest, to allure the sinner to "turn
and live," he does not shrink from forcing on his con-
victions those considerations which are fitted to alarm
his fears, the terrors of the Lord, and the wrath, not
merely of an offended Lawgiver, but of a God of love,
whose threatemngs he disregards, whose grace he des*
4 INTRODUCTION.
pises, and whose mercy he rejects. And aware of the
deceitfulness of sin in hardening the heart, and in be-
traying the sinner into a neglect of his spiritual inte-
rests, he divests him of every refuge, and strips him of
every plea for postponing his preparation for eternity.
He forcibly exposes the delusion of convenient seasons,
and the awful infatuation and hazard of delay, and
Knowing the magnitude of the stake at issue, he urges
the sinner to immediate repentance, as if the fearful
and almost absolute alternative were "Now or Never."
And to secure the commencement of such an important
work against all the dangers to which procrastination
might expose it, he endeavors to arrest the sinner in
his career of guilt and unconcern, and resolutely to fix
his determination on " turning to God this day with-
out delay."
There are two very prevalent delusions on this sub-
ject, which we should like to expose ; the one regards
the nature, and the other the season of repentance;
both of which are pregnant with mischief to the minds
of men. With regard to the first, much mischief has
arisen from mistakes respecting the meaning of tho
tenn repentance. The word repentance occurs with
two different meanings in the New Testament ; and
it is to be regretted, that two different words could not
have been devised to express these. This is charge-
able upon the poverty of our language; for it is to be
observed, that in the original Greek the distinction in
the meanings is pointed out by a distinction in tho
words. The employment of one term to denote two
different things has the effect of confounding and mis-
leading the understanding ; and it is much to be
wished, that every ambiguity of this kind were clear-
ed away from that most interesting point in the pro-
INTRODUCTION. 5
cess of a human soul, ax which it turns from sin unto
righteousness, and from the power of Satan unto God.
When in common language, a man says, " I repent
of such an action," he is understood to say, " I am sorry
for having done it." The feeling is familiar to all of
us. How often does the man of dissipation prove this
sense of the word repentance, when he awakes in the
monung, and, oppressed by the languor of ins ex-
hausted faculties, looks back with remorse on the fol-
lies and profligacies of the night that is past? How
often does the man of unguarded conversation prove it,
when he thinks of the friends whose feelings he has
wounded by some hasty utterance which he cannot
recall 1 How often is it proved by the man of business,
when he reflects on the rash engagement which ties
him down to a losing speculation? All these people
would be perfectly understood wiien they say, " We
repent of these doings." The word repentance so
applied is about equivalent to the word regret. There
are several passages in the New Testament where
this is the undoubted sense of the word repentance.
In Matt. 27: 3. the wretched Judas repented himself
of his treachery ; and surely, wlien we think of the
awful denunciation uttered by our Savior against the
man who should betray him, that it were better for
him if he had not been born, we shall never confound
the repentance which Judas experienced with that
repentance winch is unto salvation.
Now here lies the danger to practical Christianity.
In the above-cited passage, to repent is just to regret,
or to be sorry for ; and tins we conceive .o be by fai
the most prevailing sense of the term in the English
language. But there are other places where the same
term is employed to denote that which is urged upon
6 INTRODUCTION.
us as a duty — that which is preached for the remis-
sion of sins — that which is so indispensable to sinners,
as to call forth the declaration from our Savior, thai
unless we have it, we shall all likewise perish. Now,
though repentance, in all these cases, is expressed by
the same term in our translation as the repentance ot
mere regret, it is expressed by a different term in the
original record of our faith. This surely might lead
us to suspect a difference of meaning, and should cau-
tion us against taking up with that, as sufficient for
the business of our salvation, which is short of saving
and scriptural repentance. There may be an alterna-
tion of wilful sin, and of deep-felt sorrow, up to th©
very end of our history — there may be a presumptu-
ous sin committed every day, and a sorrow regularly
succeeding it. Sorrow may imbitter every act of sin —
sorrow may darken every interval of sinful indul-
gence — and sorrow may give an unutterable anguish
to the pains and the prospects of a deathbed. Couple
all this with the circumstance that sorrow passes, in
the common currency of our language, for repentance ,
and that repentance is made, by our Bible, to lie at
the turning point from a state of condemnation to a
state of acceptance with God; and it is difficult not to
conceive thai much danger may have arisen from this,
leading to indistinct views of the nature of repentance,
and to slender and superficial conceptions of the migh-
ty change which is implied in it.
We are far from saying that the eye of Christiana
is not open to this danger — and that the vigilant care
of Christian authors has not been employed in avert-
ing it. Where will we get a better definition of re-
pentance unto life than in our Shorter Catechism? by
which the sinner is represented not merely as grieving.
INTRODUCTION. 7
but, along with his grief and hatred of sin, aa turning
from it unto God with full purpose of, and endeavor
after new obedience. But the mischief is, that the
word repent has a common meaning, different from
the theological ; that wherever it is used, this common
meaning is apt to intrude itself, and exert a kind of
habitual imposition upon the understanding — that the
influence of the single word carries it over the influ-
ence of the lengthened explanation — and thus it is
that, for a steady progress in the obedience of the
gospel, many persevere, to the end of their days, in a
wretched course of sinning and cf sorrowing, without
fruit and without amendment.
To save the practically mischievous effect arising
from the application of one term to two different things,
one distinct and appropriate tenn has been suggested
for the saving repentance of the New Testament.
The term repentance itself has been restricted to the
repentance of mere sorrow, and is made equivalent to
regret ; and for the other, able translators have
adopted the word reformation. The one is expressive
of sorrow for our past conduct ; the other is expressive
of our renouncing it. It denotes an actual turning
from the habits of life that we are sorry for. Give us,
say they, a change from bad deeds to good deeds,
from bad habits to good habits, from a life of wicked-
ness to a life of conformity to the requirements of
heaven, and you give us reformation.
Now there is often nothing more unprofitable than
a dispute about words ; but if a word has got into com-
mon use, a common and generally understood mean-
ing is attached to it ; and if this meaning does not
just come up to the thing which we want to express
by it, the application of that word to that thing has
3 INTRODUCTION-
the same misleading effects as in the case already
alluded to. Now, we have much the same kind of
exception to allege against the term reformation, that
we have alleged against the term repentance. The
term repentance is inadequate — and why? because,
in the common use of it, it is equivalent to regret, and
regret is short of the saving change that is spoken of
in the New Testament. On the very same principle,
we count the term reformation to be inadequate. We
think that, in common language, a man would receive
the appellation of a reformed man upon the mere
change of his outward habits, without any reference
to the change of mind and of principle which gave
rise to it. Let the drunkard give up his excesses —
let the backbiter give up his evil speakings — let the
extortioner give up his unfair charges — and we would
apply to one and all of them, upon the mere change
of their external doings, the character of reformed
men. Now, it is evident that the drunkard may give
up his drunkenness, because checked by a serious im-
pression of the injury he has been doing to his health
and his circumstances. The backbiter may give up
his evil speaking, on being made to perceive that the
hateful practice has brought upon him the contempt
and alienation of his neighbors. The extortioner may
give up his unfair charges, upon taking it into calcu-
lation that his business is likely to suffer by the deser-
tion of his customers. Now, it is evident, that though
in each of these cases there has been what the world
would call reformation, there has not been scriptural
repentance. The deficiency of the former term con-
gists in its having been employed to denote a mere
change in the deeds or in the habits of the outward
man • and if employed as equivalent to repentance, it
INTRODUCTION. y
may delude us into the idea that the change by which
we are made meet for a happy eternity is a far more
slender and superficial thing than it really is. It is
of little importance to be told that the translator means
it only in the sense of a reformed conduct, proceeding
from the influence of a new and a right principle
within. The common meaning of the word will, as
in the former instance, be ever and anon intruding
itself, and get the better of all the formal cautions, and
all the qualifying clauses of our Bible commentators.
But, will not the original word itself throw some
light upon this important question? The repentance
which is enjoined as a duty — the repentance which
is unto salvation — the repentance which sinners un-
dergo when they pass to a state of acceptance with
God from a state of enmity against him — these are
all one and the same thing, and are expressed by one
and the same word in the original language of the
New Testament. It is different from the word which
expresses the repentance of sorrow ; and if translated
according to the parts of which it is composed, it sig-
nifies neither more nor less than a change of mind.
This of itself is sufficient to prove the inadequacy ot
the term reformation^a term which is often applied
to a man upon the mere change of his conduct, with-
out ever adverting to the state of his mind, or to the
kind of change in motive and in principle which it
has undergone. It is true, that there can be no change
in the conduct without some change in the inward
principle. A reformed drunkard, before careless about
health or fortune, may be so far changed as to become
impressed with these considerations; but this change
is evidently short of that which the Bible calls repent-
ance toward God. It is a change that may, and has
10 INTRpDUCTION.
taken place in many a mind, when there was no
effectual sense of the God who is above us, and of the
eternity which is before us. It is a change, brought
about by the prospect and the calculation of worldly
advantages ; and, in the enjoyment of these advan-
tages it hath its sole reward. But it is not done unto
God, and God will not accept of it as done unto him.
Reformation may signify nothing more than the mere
surface-dressing of those decencies, and proprieties,
and accomplishments, and civil and prudential duties,
which, however fitted to secure a man's acceptance
in society, may, one and all of them, consist with a
heart alienated from God, and having every principle
and affection of the inner man away from him. True,
it is such a change as the man will reap benefit from,
as his friends will rejoice in, as the world will call
reformation ; but it is not such a change as will make
him meet for heaven; nor is it, in its import, what our
Savior speaks of, when he says, " I tell you nay, ex-
cept ye repent, ye shall all likewise perish."
There is no single word in the English language
which occurs to us as fully equal to the faithful ren-
dering of the term in the original. Renewedness oj
mind, however awkward a phrase this may be, is
perhaps the most nearly expressive of it. Certain it
is, that it harmonizes with those other passages of the
Bible where the process is described by which saving
repentance ie brought about. We read of being
transformed by the renewing of our minds, of the re-
newing of the Holy Ghost, of being renewed in the
spirit of our minds. Scriptural repentance, therefore,
is that deep and radical change whereby a said turns
from the idcls of sin and of self unto God, and de-
votes eve?y movement of the inner and Vie outer man
INTRODUCTION. 11
to ihe. captivity of his obedicwe. This is the change
which, whether it be expressed by one word or not in
the English language, we would have you well to
understand ; and reformation or change in the out-
ward conduct, instead of being saving and scriptural
repentance, is what, in the language of John the
Baptist, we would call a fruit meet for it. But if
miscliief is likely to arise, from the want of an ade-
quate word in our language, to that repentance which
is unto salvation, there is one effectual preservative
against it — a firm and consistent exhibition of the
whole counsel and revelation of God. A man who is
well read in his New Testament, and reads it with
docility, will dismiss all his meagre conceptions of
repentance when he comes to the following state-
ments: — "Except a man be born again he cannot
see the kingdom of God." " Except ye be converted,
and become as little children, ye shall not enter into
the kingdom of heaven." " If any man have not the
Spirit of Christ he is none of his." " The carnal
mind is enmity against God ; and if ye live after the
flesh ye shall die; but if ye, through the Spirit, do
mortify the deeds of the body, ye shall live." " Be not
then conformed to this world, but be ye transformed
by the renewing of your minds." Such are the terms
employed to describe the process by which the soul
of man is renewed unto repentance ; and, with your
hearts familiarized to the mighty import of these
terms, you will carry with you an effectual guarantee
against those false and flimsy impressions, which are
so current in the world, about the preparation of a
sinner for eternity. *****
We should like, moreover, to reduce every man to
the feeling of repentance now or the alternative of
12 INTRODUCTION.
repentance never. We should like to flash it upmi
your convictions, that, by putting the call away from
you now, you put your eternity away from you. We
should like tc expose the whole amount of that accure
ed infatuation which lies in delay. We should like to
arouse every soul out of its lethargies, and give noquar*
ter to the plea of a little more sleep, and a little more
slumber. We should like you to feel as if the whole of
your future destiny hinged on the very first movement
to which you turned yourselves. The work of repent-
ance must have a beginning; and we should like you
to know that, if not begun to-day, the chance will be
less of its being begun to-morrow. And if the greater
chance has failed, what hope can we build upon the
smaller?— and a chance to that is always getting
smaller. Each day, as it revolves over the sinner's
head, finds him a harder, and a more obstinate, ana
a more helplessly enslaved sinner, than belbre. It
was this consideration which gave Richard Baxter
such earnestness and such urgency in his " Call." He
knew that the barrier in the way of the sinner's return
was strengthened by every act of resistance to the call
which urges it. That the refusal of this moment
hardened the man against the next attack of a Gos-
pel argument that is brought to bear upon him. That
-.f he attempted you now, and he failed, when he came
back upon yoa he would find himself working on a
more obstinate and uncomplying subject than ever.
And therefore it is that he ever feels as if the present
were his only opportunity. That he is now upon his
vantage ground, and he gives every energy of his
soul to the great point of making the most of it. He
will put up with none of your evasions. He will
consent to none of your postponements. He will pay
INTRODUCTION. 13
respect to none of your more convenient seasons. He
tells you, that the matter with which he is charged
lias all the urgency of a matter in hand. He speaka
to you with as much earnestness as if he knew that
you were going to step into eternity in half an hour.
He delivers his message with as much solemnity as if
he knew that tins was your last meeting on earth,
and that you were never to see each other till you
stood together at the judgment-seat. He knew that
some mighty change must take place in you ere you
be fit for entering into the presence of God ; and that
the time in which, on every plea of duty and of inte
rest, you should bestir yourselves to secure this, is the
present time. This is the distinct point he assigns to
himself; and the whole drift of his argument is to
urge an instantaneous choice of the better part, by
telling you how you multiply every day the obstacles
to your future repentance, if you begin not the work
of repentance now.
Before bringing our Essay to a close we shall make
some observations on the mistakes concerning repent-
ance, which we have endeavored to expose, and ad-
duce some arguments for urging on the consciences of
our -readers tke necessity and importance of imme-
diate repentance.
1. The work of repentance is a work which must
be done ere we die ; for, unless we repent, we shall all
likewise perish. Now, the easier this work is in our
conception, we shall think it the less necessary to enter
upon it immediately. We shall leok upon it as a
work that may be done at any time, and therefore put
it off a little longer, and a little longer. We shall,
perhaps, look forward to that retirement from the
world and its temptations which we figure old age to
Sax. Call, g
14 1NTR0DDCTI0N.
bring along with it, and falling in with the too com
mon idea, that, the evening of life is the appropriate
season of preparation for another world, we shall
think that the author is bearing too closely and too
urgently upon us, when, in the language of the Bible,
he speaks of " to-day," while it is called to-day, and
will let us off with no other repentance than repent-
ance "now," seeing that now only is the accepted
time, and now only the day of salvation, which he
has a warrant to proclaim to us. This dilatory way
of it is very much favored by the mistaken and very
defective view of repentance which we have attempt-
ed to expose. We have some how or other got into
the delusion that repentance is nothing but sorrow;
and were we called to fix upon the scene where this
sorrow is likely to be felt in the degree that it is deep-
est and most overwhelming, we would point to the
chamber of the dying man. It is awful to think that,
generally speaking, this repentance of mere sorrow is
the only repentance of a death-bed. Yes ! we shall
meet with sensibility deep enough and painful enough
there — with regret in all its bitterness — with terror
mustering up its images of despair, and dwelling
upon them in all the gloom of an affrighted imagina-
tion ; and this is mistaken, not merely for the drapery
of repentance, but for the very substance of it. We
look forward, and we count upon this — that the sins
of a life are to be expunged by the sighing and sor
rowing of the last days of it. We should give up this
wretchedly superficial notion of repentance, a nd cease,
from this moment, to be led astray by it. The mind
may sorrow over its corruptions at the very time that
it is under the poAver of them. A man may weep
mast bitterly over the perversities of his moral consli-
INTRODUCTION. 15
uition; but to change that constitution, under the
workings of the Holy Spirit, is a different affair.
"Now, this is the mighty work of repentance. He who
has undergone it is no longer the servant of sin. He
dies unto sin, he lives unto God. A sense of the au-
thority of God is ever present with him, to wield the
ascendancy of a great master-principle over all his
movements — to call forth every purpose, and to carry
it forward, through all the opposition of sin and of
Satan, into accomplishment. This is the grand revo-
lution in the s£ate of the mind which repentance
brings along with it. To grieve because this work is
not done, is a very different thing from the doing of it.
A deathbed is the very best scene for acting the first ,
but it is the very worst for acting the second. The re-
pentance of Judas has often been acted there. We
ought to think of the work in all its magnitude, and
not to put it off' to that awful period when the soul is
crowded with other things, and has to maintain its
weary struggle with the pains, and the distresses,
and the shiverings, and the breathless agonies cf a
deathbed.
2. There are two views that may be taken of the
way in which repentance is brought about, and which-
ever of them is adopted, delay carries along with it
the saddest infatuation. It may be looked upon as
a step taken by man as a voluntary agent, and we
would ask you, upon your experience of the powers
and the performances of humanity, if a deathbed is
the time for taking such a step? Is this a time for a
voluntary being exercising a vigorous control over his
own movements? When racked with pain, and borne
down by the pressure cf a sore and overwhelming
calamity ? Surely the greater the work of repentance
16 INTRODUCTION.
is, the more ease, the more time, the more freedom
from suffering, is necessary for carrying itonj and,
therefore, addressing you as voluntary beings, as
beings who will and who do, we call upon you to seek
God early that you may find him— to haste, and make
no delay in keeping his commandments.
The other view is, that repentance is not a self-
originating work in man, but the work of the Holy
Spirit in him as the subject of its influences. This
view is not opposite to the former. It is true that man
wills and does at every step in the business of his sal-
vation; and it is as true that God works in him so to
will and to do. Take this last view of it then. Look
on repentance as the work of God's Spirit in the soul
of man, and we are furnished with a more impressive
argument than ever, and set on higher vantage for
urging you to stir yourselves, and set about it im-
mediately. What is it that you propose ? To keep
by your present habits, and your present indulgences,
and build yourselves up all the while in the confidence
that the Spirit will interpose with his mighty power
of conversion upon you, at the very point of time that
you have fixed upon as convenient and agreeable?
And how do you conciliate the Spirit's answer to your
call then? Why, by doing all you can to grieve, and
to quench, and to provoke him to abandon you now.
Do you feel a motion toward repentance at this mo-
ment? If you keep it alive, and act upon it, good and
well. But if you smother and suppress this motion,
you resist the Spirit — you stifle his movements within
you ; it is what the impenitent do day after day, and
year after year — and is this the way for securing the
influences of the Spirit at the time that you would
like them best? When you are done with the world,
TNTR0DUCT10N. 17
and are looking forward to eternity because you can-
not help it? God says, "My Spirit shall not always
strive with man." A good and a free Spirit he un-
doubtedly is, and, as a proof of it, he is now saying,
"Let whosoever will, come and take of the water of
life freely." He says so now, but we do not promise
that he will say so with effect upon your deathbeds,
if you refuse him now. You look forward then for a
powerful work of conversion being done upon you, and
yet you employ yourselves all your life long in raising
and multiplying obstacles against it You count upon
a miracle of grace before you die, and the way you
take to make yourselves sure of it, is to grieve and
offend him while you live, who alone can perform the
miracle. O what cruel deceits will sin land us in !
and how artfully it pleads for a " little more sleep, and
a little more slumber; a little more folding of the
hands to sleep." We should hold out no longer, nor
make such an abuse of the forbearance of God : we
shall treasure up wrath against the day of wrath if
we do so. The genuine effect of his geodness is to
lead us to repentance ; let not its effect upon us be to
harden and encourage ourselves in the ways of sin.
We should cry now for the clean heart and the right
spirit; and such is the exceeding freeness of the Spirit
of Gcd, that we shall be listened to. If we put off the
cry till then, the same God may laugh at our calam-
ity, and mock when our fear cometh.
3. Our next argument for immediate repentance is,
that we cannot bring forward, at any future period o!
your history, any considerations of a more prevailing
or more powerfully moving influence than those we
may bring forward at this moment. We can tell you
now of the terrors cif the Lord, we can tell you now
2*
18 INTRODUCTION.
of the solemn mandates which have issued from his
throne — and the authority of which is upon one and
all of you. We can tell you now, that, though, in
this dead and darkened world, sin appears but a
very, trivia' affair — for every body sins, and it is
shielded from execration by the universal countenance
of an entire species lying in wickedness — yet it holds
true of God, what is so emphatically said of him, that
he cannot be mocked, nor will he endure it that you
should not in the impunity of your wilful resistance
to him and to his warnings. We can tell you now,
that he is a God of vengeance ; and though, for a
season, he is keeping back all the thunder of it from a
world that he would reclaim unto himself, yet, if you
put all his expostulations away from you, and will not
be reclaimed, these thunders will be let loose upon
you, and they will fall on your guilty heads, armed
with tenfold energy, because you have not only defied
his threats, but turned your back on his offers of re-
conciliation. These are the arguments by which we
would try to open our way to your consciences, and to
awaken up your fears, and to put the inspiring activity
of hope into your bosoms, by laying before you those
invitations which are addressed to the sinner, through
the peace-speaking blood of Jesus, and, in the name
of a beseeching God, to win your acceptance of them.
At no future period can we address arguments more
powerful and more affecting than these. If these ar-
guments do not prevail upon you, we know of none
others by which a victory over the stubborn and un-
complying will can be accomplished, or by which we
can ever hope to beat in that sullen front of resistance
wherewith you now so impregnably withstand us.
We feel thnt, if any stout-hearted sinner shall rise
INTRODUCTION. 19
from the perusal of this "Call to the Unconverted"
with an unawakened conscience, and give himself up
to wilful disobedience — we feel as if, in reference to
him, we had made our last discharge, and it fell
powerless as water spilt on the ground, that cannot be
gathered up again. Therefore it is that we speak to
you now as if this was our last hold of you. We feel
as if on your present purpose hung all the prepara-
tions of your future life, and all the rewards or all the
horrors of your coming eternity. We will not let you
off with any other repentance than repentance now ;
and if this be refused now, we cannot, with our eyes
open to the consideration we have now urged, that
the instrument we can make to bear upon you here-
after is not more powerful than we are wielding now,
coupled with another consideration w r hich we shall
insist upon, that the subject on which the instrument
worketh, even the heart of man, gathers, by every
act of resistance, a more uncomplying obstinacy than
before ; we cannot, with these two thoughts in our
mind, look forward to your future history, without
seeing spread over the whole path of it the iron of a
harder impenitency — the sullen gloom of a deeper
and more determined alienation.
4. Another argument, therefore, for immediate re-
pentance is, that the mind which resists a present call
or a present reproof, undergoes a progressive harden-
ing' toward all those considerations which arm the
call of repentance with all its energy. It is not enough
to say, that the instrument by which repentance is
brought about, is not more powerful to-morrow than
it is to-day ; it lends a most tremendous weight to the
argument, to say further, that the subject on which
this instrument is putting forth its efficiency, will op-
20 INTRODUCTION.
pose a firmer resistence to-morrow than it does to-day.
It is this which gives a significancy so powerful to the
call of "To-day while it is to-day, harden not your
hearts ;" and to the admonition of " Knowest thou not,
O man, that the goodness of God leadeth thee to re-
pentance; but after, thy hardness and impenitent
heart treasurest up wrath against the day of wrath
und revelation of the righteous judgments of God?"
It is not said, either in the one or in the other of these
passages, that, by the present refusal, you cut your-
self off from a future invitation. The invitation may
be sounded in your hearing to the last half hour of
your earthly existence, engraved in all those charac-
ters of free and gratuitous kindness which mark the
beneficent religion of the New Testament. But the
present refusal hardens you against the power and
tenderness of the future invitation. This is the fact
in human nature to which these passages seem to
point, and it is the fact through which the argument
for immediate repentance receives such powerful aid
from the wisdom of experience. It is this which forms
the most impressive proof of the necessity of plying
the young with all the weight and all the tenderness
of earnest admonition, that the now susceptible mind
might not turn into a substance harder and more un-
complying than the rock which is broken in pieces
by the powerful application of the hammer of the
word of God.
The metal of the human soul, so to speak, is like
some material substances. If the force you lay upon
it do not break it, or dissolve it, it will beat it into
hardness. If the moral argument by which it is plied
now, do not so soften the mind as to carry and to over-
power its purposes, then, on another day, the argu-
INTRODUCTION. 21
ment may be put forth in terms as impressive — but it
falls on a harder mind, and, therefore, with a more
slender efficiency. If the threat, that ye who persist
in 6in shall have to dwell with the devouring fire, and
to lie down amid everlasting burnings, do not alarm
you out of your iniquities from this very moment, then
the same tlireat may be again cast out, and the same
appalling circumstances of terror be thrown around it,
but it is all discharged on a soul hardened by its inure-
ment to the thunder of denunciations already uttered,
and the urgency of menacing threatenings already
poured forth without fruit and without efficacy. If
the voice of a beseeching God do not win upon you
now, and charm you out of your rebellion against him,
by the persuasive energy of kindness, then let that
voice be lifted in your hearing on some future day,
and though armed with all the power of tenderness
it ever had, how shall it find its entrance into a heart
sheathed by the operation of habit, that universal law.
in more impenetrable obstinacy 1 If, with the earliest
dawn of your understanding, you have been offered
the hire of the morning laborer and have refused it,
then the parable does not say that you are the person
who at the third, or sixth, or ninth, or eleventh hour,
will get the offer repeated to you. It is true, that the
offer is unto all and upon all who are within reach of
the hearing of it. But there is all the difference in
the world between the impression of a new offer, and
of an offer that has already been often heard and as
often rejected — an offer which comes upon you with
all the familiarity of a well-known sound that you
have already learned how to dispose of, and how to
shut your every feeling against the power of its gra-
cious invitations — an offer which, if discarded from
your hearts at the present moment, may come back
22 LNTR0DDCT10H.
upon you, but which will have to maintain a more
unequal contest than before, with an impcnitency ever
strengthening, and ever gathering new hardness from
each successive act of resistance. And thus it is that
the point for which we are contending is not to cany
you at some future period of your lives, but to carry
you at this moment. It is to work in you the instan-
taneous purpose of a firm and a vigorously sustained
repentance ; it is to put into you all the freshness oi
an immediate resolution, and to stir you up to all the
readiness of an immediate accomplishment — it is to
give direction to the very first footstep you are now
to take, and lead you to take it as the commencement
of that holy career in which all old things are done
away, and all things become new — it is to press it
upon you, that the state of the alternative, at this mo-
ment, is "now or never" — it is to prove how fearful
die odds are against you, if now you suffer the call of
repentance to light upon your consciences, and still
keep by your determined posture of careless, and
thoughtless, and thankless unconcern about God. You
have resisted to-day, and by that resistance you have
acquired a firmer metal of resistance against the
power of every future warning that may be brought
to bear upon you. You have stood your ground
against the urgency of the most earnest admonitions,
and against the dreadfulness of the most terrifying
menaces. On that ground ycu have fixed yourself
more immovably than before ; and though on some
future day the same spiritual thunder be made to play
around you, it will not shake you out of the obstinacy
of your determined rebellion.
It is the universal law of habit, that the feelings are
always getting more faintly and feebly impressed by
ever} 7 repetition of the cause which excited them, and
INTRODUCTION. 23
tha* the mind i<s always getting etrongcr in its active
resistance to the impulse of these feelings, by every
new deed of resistance which it performs ; and thus it
is, that if you refuse us now, we have no other pros-
pect before us than that your course is every day
getting more desperate and more irrecoverable, your
souls are getting more hardened, the Spirit is getting
mor**, provoked to abandon those who have so long
persisted in their opposition to his movements. God,
who says that h^s Spirit shall not always strive with
man, is getting more offended. The tyranny of habit
is getting every day a firmer ascendancy over you;
Satan is getting you more helplessly involved among
his wiles and his entanglements; the world, with all
the inveteracy of those desires winch are opposite to
the will of the Father, is more and more lording it
over your every affection. And what, we would ask,
what is the scene in which you are now purposing to
contest it, with all this mighty force of opposition you
are now so busy in raising up against you ? What is
the field of combat to which you are now looking
forward, as the place where you are to accomplish a
victory over all those formidable enemies whom you
are at present arming with such a weight of hostility,
as, we say, within a single hairbreadth of certainty,
you will find to be irresistible? O the bigness of such
a misleading infatuation 1 The proposed scene in
I which this battle for eternity is to be fought, and this
\ictory for the crown of glory is to be won, is a death-
bed. It is when the last messenger stands by the
couch of the dying man, and shakes at him the ter-
rors of his grisly countenance, that the poor child of
infatuation thinks he is to struggle and prevail against
all his enemies; against the unrelenting tyranny of
habit — against the obstinacy of his own heart, which
24 INTRODUCTION.
he is now doing bo much to harden — against the
Spirit of God who perhaps long ere now lias pro-
nounced the doom upon him, " He will take his own
way, and walk in his own counsel ; I shall cease from
striving, and let him alone "—against Satan, to whom
every day of his life he has given some fresh advan-
tage over him, and who will not be willing to lose
ihe victim on whom he has practised so many wiles,
and plied Avith success so many delusions. And such
are the enemies whom you, who wretchedly calculate
on the repentance of the eleventh hour, are every day
mustering up in greater force and formidablenesa
against you ; and how can we think of letting you
go with any other repentance than the repentance of
the precious moment that is now passing over you,
when we look forward to the horrors of that impressive
scene on which you propose to win the prize of im-
mortality, and to contest it singlehanded and alone,
with all the weight of opposition which you have
accumulated against yourselves — a deathbed — a lan-
guid, breathless, tossing, and agitated deathbed; that
scene of feebleness, when the poor man cannot help
himself to a single mouthful — when he must have
attendants to sit around him, and watch his every
. wish, and interpret his every signal, and turn him to
every posture where he may find a moment's ease,
and wipe aw?\y the cold sweat that is running over
him — and ply him with cordials for thirst, and sick-
ness, and insufferable languor. And this is the time,
"when occupied with such feelings, and beset with
such agonies as these, you propose to crowd within
the compass of a few wretched days the work ol
winding up the concerns of a neglected eternity!
5. But it may be said, "If repentance be what you
cepresent it, a tiling of such mighty import, and sucb
INTRODUCTION. 25
impracticable performance, as a change of mind, in
what rational way can it be made the subject of a
precept or injunction? you would not call upon the
Ethiopian to change his skin — you would not call
upon the leopard to change his spots j and yet you call
upon us to change our minds. You say, " Repent ;"
and that too in the face of the undeniable doctrine, that
man is without strength for the achievement of so
mighty an enterprise. Can you tell us any plain and
practicable thing that you would have us tD perform,
and that we may perform, to help on this business?"
This is the very question with which the hearers of
John the Baptist came back upon him, after he had
told them in general terms to repent, and to bring forth
fruits meet for repentance. He may not have resolved
the difficulty, but he pointed the expectation of his
countrymen to a greater than he for the solution of it.
Now that Teacher has already come, and we live
under the full and the finished splendor of his revela-
tion. O that the greatness and difficulty of the work
of repentance had the effect of shutting you up into
the faith of Christ ! Repentance is not a paltry, super-
ficial reformation. It reaches deep into the inner man,
but not too deep for the searching influences of that
Spirit which is at his giving, and which worketh
mightily in the hearts of believers. You should go
then under a sense of your difficulty to Him. Seek
to be rooted in the Savior, that you may be nourished
out of his fulness, and strengthened by iiis might.
The simple cry for a clean heart, and a right spirit,
which is raised from the mouth of a believer, brings
down an answer from on high which explains all the
difficulty and overcomes it. And if what we have
eaid of the extent and magnitude of repentance, should
have the effort to give a deeper feeling than before of
Bax.Call. 3
26 INTRODUCTION.
the wants under which you labor ; and shall dispose
you to seek after a closer and more habitual urnon
with Him who alone can supply them, then will our
call to repent have indeed fulfilled upon you the ap-
pointed end of a preparation for the Savior. But re-
collect now is your time, and now is your opportunity,
for entering on the road of preparation that leads to
heaven. We charge you to enter this road at this
moment, as you value your deliverance from hell, and
your possession of that blissful place where you shall
be for ever with the Lord — we charge you not to
parry and to delay this matter, no not for a single
hour — we call on you by all that is great in eternity —
by all that is terrifying in its horrors — by £.11 that ia
alluring in its rewards — by all that is binding in the
authority of God — by all that is condemning in the
ee\ erity of his violated law, and by ail that can aggra-
vate this condemnation in the insulting contempt of
his rejected gospel ; — we call on you by one and ah
of these considerations, not to hesitate, but to flee —
not to purpose a return for to-morrow, but to make
an actual return this very day — to put a decisive end
to every plan of wickedness on which you may havw
entered — to cease your hands from all that is ibrbid-
den — to turn them to all that is required — to betake
yourselves to the appointed Mediator, and receive
through him, by the prayer of faith, -such constant
supplies of the washing of regeneration and renewing
of the Holy Ghost, that, from this moment, you may
be carried forward from one degree of grace unto
another, and from a life devoted to God here, to the
elevation of a triumphant, and the joys of a blissfirl
eternity hereafter. T. C
8t Andrew'*, October, 1825.
CONTENTS.
Hie Text opened, . . 31
Doctrine I. — It is the unchangeable law of God, that
wicked men must turn or die — Proved, . 34
God will not be so unmerciful as to damn us —
Answered, ..... 37
The Use, ... .40
Who are wicked men, and wnat conversion is; and
how we may know whether Ave are wicked or con-
verted, ..... 43
Applied, ..... 50
Doct. II. — It is the promise of God that the wicked
shall live, if they will but turn; unfeignedly and
thoroughly turn — Proved, . . 6
Doct, III. — God taketh pleasure in men's conversion
and salvation, but not in their death or damnation
He had rather they would turn and live, than go on
and die — Expounded — Proved, . . 68
Doct. IV. — The Lord hath confirmed it to us by his
oath, That he has no pleasure in the death of th*
wicked, but rather that he turn and live; that ht>
may leave roan no pretence to question the truth
of it, 75
Use. — Who is it, then, that takes pleasure in men's
sin and death 1 — Not God, nor ministers, nor any
good men, ..... 76
Doct. V. — So earnest is God for the conversion of
sinners, that he doubleth his commands and exhor
tations with vehemency, "Turn ye, Turn ye," —
Applied, .... 82
Some motive* t j obey God's call, and turn, 85
28 CONTEXTS.
Doct. VI. — The Lord condescendeth to reason the
case with unconverted sinners, and ask them, Why
they will die? .... 9;
A strange disputation; — 1. For the question. 2.
The disputants.
Wicked men will die or destroy themselves.
Use. — The sinner's case is certainly unreasonable, 102
Their seeming reasons confuted, . . 108
Question. — Why are men so unreasonable, and loath
to turn, and will destroy themselves? — Answered, 119
Doct. VII. — If after all this, men will not turn, it is
not God's fault that they are condemned, but their
own, even their own wilfulness. They die because
they will; that is, because they will not turn, 122
Use, 1. — How unfit the wicked are to charge God
with their damnation. It is not because God is
unmerciful, but because they are cruel and mer-
ciless to themselves, . . . 12D
Object. — We cannot convert ourselves, nor have
we Free-will — Answered. . . . 134
Use 2. — The subtlety of Satan, the deceitfulness of
sin, and the folly of sinners manifested, . 136
Use, 3. — No w T onder if the wicked would hinder the
conversion and salvation of others, . . £136
Use, 4. — Man is the greatest enemy to himself, 137
Man's destruction is of himself — Proved, . 130
The heinous aggravations of self-destroying, . 144
The concluding exhortation, . . . 146
Ten Directions for those who had rather turn than
die, 151
THE GREAT SUCCESS WHICH ATTENDED THE
CALL WHEN FIRST PUBLISHED.
It may be proper lo prefix an account of this book given
by Mr. Baxter himself, which was found in his study, after
bis death, in his own words:
" I published a short treatise on conversion, entitled, A
Call to the Unconverted. The occasion of this was my
converse with Bishop Usher while I was at London; who,
approving my method and directions for Peace of Con-
science, was importunate with me to write directions
suited to the various states of Christians, and also agains*
particular sins. I reverenced the man, but disregardea
these persuasions, supposing I could do nothing but what
is done better already: but when he was dead, his words
went deeper to my mind, and I purposed to obey his coun-
sel; yet, so as that to the first sort of men, the ungodly,
1 thought vehement persuasions meeter than directions
only, and so for such I published this little book, which
God hath blessed with unexpected success, beyond all the
rest that I have written, except The Saint's Rest. In a
little more than a year there were about twenty thousand
of them printed by my own consent, and about ten thou-
sand since, beside many thousands by stolen impressions,
which poor men stole for lucre's sake. Through God's
mercy I have information of almost whole households
converted by this small book which I set so light by; and,
as if all this in England, Scotland, and Ireland, were not
mercy enough to me, God, since I was silenced, hath sent
it over in life message to many beyond the seas ; xor when
30 ADVERTISEMENT.
Mr. Elliot bad printed all the Bible in tbe Indian language,
be next translated this my Call to the Unconverted, as he
wrote to us here. And yet God would make some farther
use of it ; for Mr. Stoop, tbe pastor of the French Churcb
in London, being driven hence by the displeasure of hi»
superiors, was pleased to translate it into French. I hopf
it will not be unprofitable there; nor in Germany, when
also it has been printed."
It may be proper further to mention Dr. Bates' account
of the author, and of this useful treatise. In his sermon
at Mr. Baxter's funeral, he thus says: 'His books of
practical divinity have been effectual for more conver-
sions of sinners to God than any printed in our time : and
while the churcb remains on earth, will be of continual
efficacy to recover lost souls. There is a vigorous pulse
in thern, that keeps the reader awake and attentive. His
Call to the Unconverted, how small in bulk, but how
powerful in virtue ! Truth speaks in it with that authority
and efficacy, that it makes the reader to lay his hand upon
bis heart, and find that he has a soul and a conscience,
though he lived before as if he had none. He told some
friends, that six brothers were converted by reading that
Call; and that every week he received letters of some
converted by his books. This he spake witb most hum-
ble thankfulness, that God was pleased to use him as an
instrument for the salvation of souls."
A CALJL,
TO THS UNCONVERTED.
EZEKIEL, XXXIII. 11.
Say unto them, As Hive, saith the Lord God, Ihavt
no pleasure in the death of the wicked; bid thai
the wicked turn from his way and live: turn ye*
turn ye from your evil ways; for why will ye die,
O house of Israeli
Jr hath been the astonishing wonder of many a
man as well as me, to read in the Holy Scriptures how
few will be saved, and that the greatest part even of
those that are called, will be everlastingly shut out of
the kingdom of heaven, and be tormented with the
devils in eternal fire. Infidels believe not this when
they read it, and therefore they must feel it ; those
that do believe it are forced to cry out with Paul,
(Rom. 11. 13,) " O the depth of the riches both of the
wisdom and knowledge of God ! How unsearchable
are his judgments, and his ways past finding out !"
But nature itself doth teach us all to lay the blame
of evil works upon the doers ; and therefore when we
see any heinous thing done, a principle of justice doth
provoke us to inquire after him that did it, that the
evil of the work may return the evil of shame upon
the author. If we saw a man killed and cut in pieces
by the way, we would presently ask, Oh ! who did
this cruel deed? If the town was wilfully set on fire,
you would ask, what wicked wretch did this? So
when we read that many souls will be miserable in
hell for ever, we must needs think with ourselves, how
somes this to pass? and whose fault is it? Who is it
32 A CALL TO Doct. 1.
that is so cruel as to be the cause of such a thing as
ihis? and we can meet with few that will own the
guilt. It is indeed confessed by all, that Satan is the
cause; but that doth not resolve the doubt, because
lie is not the principal cause. He doth not force men
to sin, but tempts them to it, and leaves it to their
own wills whether they will do it or not. He doth not
carry men to an alehouse and force open their mouths
and pour in the drink ; nor doth he hold them that
they cannot go to God's service ; nor doth he force
their hearts from holy thoughts. It lieth therefore
between God himself and the sinner ; one of them
must needs be the principal cause of all this misery,
whichever it is, for there is no other to lay it upon;
and God disclaimeth it ; he will not take it upon him ;
and the wicked disclaim it usually, and they will not
take it upon them, and this is the controversy that is
here managing in my text.
The Lord complaineth of the people ; and the peo-
ple think it is the fault of God. The same controversy
is handled, chap. 18. 25: they plainly say, " that the
way of the Lord is not equal." So here they say,
verse 19, " If our transgressions and our sins be upon
us, and we pine away in them, how shall we then
live?-' As if they should say, if we must die, and be
miserable, how can we help it ? as if it were not theii
fault, but God's. But God, in my text, doth clem
himself of it, and telleth them how they may help h
if they will, and persuadeth them to use the means,
and if they will not be persuaded, he lets them know
that it is the fault of themselves ; and if this will not
satisfy them, he will not forbear to punish them. It is
he that will be the Judge, and he will judge them
according to their ways ; they are no judge of hirn
Ooct. I. THE UNCONVERTED. 33
or of themselves, as wanting authority, and wisdom,
and impartiality ; nor is it tlie cavilling and quarrelling
with God that shall serve their turn, or save them
from the execution of justice, at which they murmur.
The words of this verse contain, 1. God's purgation
or clearing himself from the blame of their destruction.
This he doth not by disowning his law, that the
wicked shall die, nor by disowning his judgments and
execution according to that law, or giving them any
hope that the law shall not be executed ; but by pro-
fessing that it is not their death that he takes pleasure
in, but their returning rather, that they may live ; and
tliis he confirmeth to them by his oath. 2. An ex-
press exhortation to the wicked to return; wherein
God doth not only command, but persuade and con-
descend also to reason the case with them; Why will
they die ? The direct end of this exhortation is, that
they may turn and live. The secondary or reserved
ends, upon supposition that this is not attained, are
these two : First, To convince them by the means
which he used, that it is not the fault of God if they
be miserable. Secondly, To convince them from
their manifest wilfulness in rejecting all his commands
and persuasions, that it is the fault of themselves, and
they die, even because they will die.
The substance of the text doth lie in these observa-
tions following : —
Doctrine 1 . It is the unchangeable law of God, that
wicked men must turn or die.
Doctrine 2. It is the promise of God, that the wicked
shall live, if they will but turn.
Doctrine 3. God takes pleasure in men's conversion
and salvation, but not in their death or damnation: ho
34 A CALL TO Doct. I
had rather they would return and live, than go on
and die.
Doctrine 4. This is a most certain truth, which
because God would not have men to question, he hath
confirmed it to them solemnly by his oath.
Doctrine 5. The Lord doth redouble his commands
and persuasions to the wicked to turn.
Doctrine 6. The Lord condescendeth to reason the
case with them ; and asketh the wicked why they
will die?
Doctrine 7. If after all this the wicked will not turn,
it is not the fault of God that they perish, but of them-
selves; their own wilfulness is the cause of their
own damnation ; they therefore die because they
will die.
Having laid the text open in these propositions, I
shall next speak somewhat of each of them in order,
though briefly.
DOCTRINE I.
It is the unchangeable laic of God, that wicked
men must turn, or die.
If yon will believe God, believe this : there is but
one of these two ways for every wicked man, either
conversion or damnation. I know the wicked will
hardly be persuaded either of the truth or equity of
this. No wonder if the guilty quarrel with the law.
Few men are apt to believe that which they would
not have to be true, and fewer would have that to be
true which they apprehended to be against, them. But
it is not quarrelling with the law, or with the judge,
that will save the malefactor. Believing and regard-
ing the law, might have prevented his death ; but
denying and accusing it will but hasten it. If it were
OocC I. THE UNCONVERTED. 85
Tiot so, a hundred would bring their reason against the
law, for one that would bring his reason to the law,
and men would rather choose to give their reasons
why they should not be punished, than to hear the
commands and reasons of their governors which re-
quire them to obey. The law was not made for you to
judge, but that you might be ruled and judged by it.
But if there be any so blind as to venture to ques-
tion either the truth or the justice of this law of God,
I shall briefly give you that evidence of both which
methinks, should satisfy a reasonable man.
And first, if you ijpubt whether this be the word of
God, or not, besides a hundred other texts, you may
be satisfied by these few:— Matt. 18: 3. "Verily I
say unto you, except ye be converted and become as
little children, ye cannot enter into the kingdom of
God." John 3:3. " Verily, verily, I say unto you,
except a man be born again he cannot see the king-
dom of God." 2 Cor. 5: 17. ll If any man be in Christ,
he is a new creature ; old things are passed away ;
behold, all things are become new." Col. 3: 9, 10.
"Ye have put off the old man with his -deeds, and
have put on the new man, which is renewed in know-
ledge alter the image of him that, created him.' Heb.
12: 14. " Without holiness no man shall see the
Lord." Rom. 8: 8, 9. "So then they that are in the
flesh cannot please God. Now if any man have not the
spirit of Christ, he is none of his." Gal. 6: 15. "For
in Christ Jesus neither circumcision availeth any
thing, nor uncircumcibion, but a new creature." 1 Pet.
1:3. " According to hie abundant grace he hath be-
gotten us to a lively hope." Ver. 23. ."Being born
again, not of corruptible seed, but of incorruptible, by
the word of God, wnich liveth and abideth for ever."
36 A CALL to Dact. I
1 Pet. 2: 1, 2. "Wherefore laying aside all malice.
and all guile, and hypocrisies, and envies, and evi]
speaking, as new born babes, detare the sincere milk
of the word, that ye may grow thereby." Psalm 9:
17. " The wicked shall be turned into hell, and all thr
nations that forget God." Psalm 11 f 4. "And tin.
Lord lcveth the righteous, but the wicked his sou/
liateth."
As I need not stay to open these texts which are
so plain, so I think I need not add any more of that
multitude which speak the like. If thou be a man
that dost believe the Word of God, here is already
enough to satisfy thee that the wicked must be con-
verted or condemned. You are already brought so
far, that you must either confess that this is true, or
say plainly, you will not believe the word of God.
And if once you be come to that pass, there is but
small hopes of you : look to yourself as well as you
can, for it is like you will not be long out of hell. You
would be ready to fly in the face of him that should
give you the lie ; and yet dare you give the lie to
God? But if you tell God plainly you will not believe
him, blame him not if he never warn you more, or if
he forsake you, and give you up as hopeless ; for to
what purpose should he warn you, if you will not be-
lieve him ? Should he send an angel from heaven to
you, it seems you would not believe. For an angel
can speak but the word of God ; and if an angel should
bring you any other gospel, you are not to receive it
but to hold him accursed. Gal. 1 : 8. And surely there
is no angel to be believed before the Son of God, who
came from the Father to bring us this doctrine. If He
be not to be believed, then all the angels in heaven
are not to be believed. And if you stand on theao
0ocV i. THE UNCONVERTED. 37
terms with God, I shall leave you till he deal with you
in a more convincing way. God hath a voice that
will make you hear. Though he entreat you to hear
the voice of Lis gospel, he will make you hear the
vDice of his condemning sentence, without entreaty.
We cannot make you believe against your wills ; but
God will make you feel against your wills.
But let us hear what reason you have why you will
oot believe this word of God, which tells us that the
wicked must be converted, or condemned. I know
your reason ; it is because that you judge it unlikely
that God should be so unmerciful : you think it cruelty
to damn men everlastingly for so small a tiling as a
sinful life. And this leads us to the second thing,
which is to justify the equity of God in Ms laws and
judgments.
And first, I think you will not deny that it is most
suitable to an immortal soul to be ruled by laws that
promise an immortal reward, and threaten an endless
punishment. Otherwise the law should not be suited
to the nature of the subject, who will not be fully
ruled by any lower means than the hopes or fears of
everlasting ihings : as it is in cases of temporal pun-
ishment, if a law were now made that the most hei-
nous crimes shall be punished with a hundred years'
captivity, this might be of some efficacy, as being
^qual to our lives. But, if there had been no other
penalties before the flood, when men lived eight or
nine hundred years, it would not have been sufficient,
because men would know that they might have so
many hundred years impunity afterward. So it is
in our present case.
2. I suppose that you will confess, that the promise
of an endless and inconceivable priory is not so unsuit-
Bax. Call. 4
38 A CALL TO Doct k
abb to the wisdom of God, or the case of man : ana
why then should you not think so of the threatening
of an endless and unspeakable misery !
3. When you find it in the word of God that so il
is, and so it will be, do ye think yourselves fit to con-
tradict this word ? Will you call your Maker to the
bar, and examine his word upon the accusation of
falsehood? Will you sit upon him and judge him by
the law of your conceits ? Are you wiser, and better,
and more righteous than he? Must the God of heaven
come to school to you to learn wisdom 1 Must Infinite
Wisdom learn of folly, and Infinite Goodness be cor-
rected by a sinner that cannot keep himself an hour
clean? Must the Almighty stand at the bar of a
worm? O horrid arrogancy of senseless dust! shall
ever mole, or clod, or dunghill, accuse the sun of dark-
ness, and undertake to illuminate the world ? Where
were you when the Almighty made the laws, that
he did not call you to his counsel ? Surely he made
them before you were born, without desiring your
advice ; and you came into the world too late to re-
verse them, if you could have done so great a work.
You should have stepped out of your nothingness and
have contradicted Christ when he was on earth, or
Moses before him, or have saved Adam and his sinful
progeny from the threatened death, that eo there
might have been no need of Christ. And what if
God withdraw his patience and sustaining power, and
let you drop into hell while you are quarrelling with
his w T ord, will you then believe that there is a hell ?
4, If sin be such an evil that it requireth the death
of Christ for its expiation, no w-onder if it deserve our
everlasting misery.
Ooct. 1. THE UNCONVERTED. 39
5. And if the sin of the devils deserved an endless
torment, why not also the sin of man ?
6. And methinks you should perceive that it is not
possible for the best of men, much less for the wicked,
to be competent jud ges of the desert of sin. Alas ! we
are both blind and partial. You can never know fully
Lhe desert of sin, till you fully know the evil of sin;
and you can never fully know the evil of sin, till you
folly know, 1. The excellency cf the soul which it
deformeth. 2. And the excellency of holiness which
it obliterates. 3. The reason and excellency of the
law which it violates. 4. The excellency of the
glory which it despises. 5. The excellency and of-
fice of reason which it treadeth down. 6. No, nor till
you know the infinite excellency, al mightiness and
holiness of that God against whom it is committed.
When you fully know all these, you shall fully know
the desert of sin besides. You know that the offender
is too partial to judge the law, or the proceeding of
his judge. We judge by feeling which blinds our
reason. We see, in common worldly things, that most
men think the cause is right which is their own, and
that ail is wrong that is done against them ; and let
•_he most wise or just impartial friends persuade. them
to the contrary, and it is all in vain. There are few
children but think the father is unmerciful, or dealeth
hardly with them if he whip them. There is scarce
the vilest wretch but thinketh the church doth wrong
him if they excommunicate him : or scarce a thief or
murderer that is hanged, but would accuse the law
and judge of cruelty, if that would serve their turn.
7. Can you think that an unholy soul is fit for
heaven? Alas, they cannot love God here, nor do him
any service wmich he can accept. They are contrary
40 a CALL TO Doct. 1.
to God, they loathe that which he moyt loveth, and
love that which he abhorreth. They are incapable
of that imperfect communion with Him which his
saints here partake of. How then can they live in
ihat perfect love of him, and full delight and com-
munion with him, which is the blessedness of heaven?
Fou do not accuse yourselves of unmerciful ness, if
you make not your enemy your bosom counsellor ; or
if you take not your swine to bed and board with you :
no, nor if you take away his life though he never sin-
ned ; and yet you will blame the absolute Lord, the
most wise and gracious Sovereign of the world, if he
condemn the unconverted to perpetual misery.
Use. — I beseech you now, all that love your souk,
that, instead of quarrelling with God and with his
word, you will presently receive it, and use it for your
good. All you that are yet unconverted, take this as the
undoubted truth of God : — You must, ere long, be con-
verted or condemned ; there is no other way but to
turn, or die. When God, that cannot lie, hath told
you this; when you hear it from the Maker and
Judge of the world, it is time for him that hath ears,
to hear. By this time you may see what you have
to trust to. You are but dead and damned men, ex-
cept you will be converted. Should I tell you other-
wise, I should deceive you with a lie. Should I hide
this from you, I should undo you, and be guilty of your
blood, as the verses before my text assure me. — Verse
8. " When I say to the wicked man, O wicked man,
thou shalt surely die ; if thou dost not speak to warn
the wicked from his way, that wicked man shall die in
his iniquity; but his blood will I require at thine
hand." You see then, though this be a rough and
unwelcome doctrine, it is such as we must preach, and
Doct. 1. THE UNCONVERTED. 41
you must hear. It is easier to hear of hell than feel
it. If your necessities did not require it, we wculd
not gall your tender ears with truths that seem so
harsh and grievous. Hell would not be so full, if peo-
ple were but willing to know their case, and to hear
and think of it. The reason why so few escape it, is
because they strive not to enter in at the strait gate of
conversion, and go the narrow way of holiness, while
they have time : and they strive not, because they are
not awakened to a lively feeling of the danger they
are in ; and they are not awakened because they are
loth to hear or think of it : and that is partly through
foolish tenderness and carnal self-love, and partly be-
cause they do not well believe the word that threat-
ened it. If you will not thoroughly believe this truth,
methinks the weight of it should lbrce you to remem-
ber it, and it should follow you, and give you no rest
till you are converted. If you had b^t once /»eard
this word by the voice of an angel, " Thou nvtisi be
converted, or condemned : turn, or die :" would it mt
stick in your mind, and haunt you night and day? so
that in your sinning you would remember it, as if the
voice were still in your ears, " Turn, or die !" O hap-
py were your soul if it might thus work with you and
never be forgotten, or let you alone till it have driven
home your heart to God. But if you will cast it out
by forgetfuli: ss or unbelief how can it work to your
conversion an A salvation 1 But take this with you to
your sorrow, though you may put this out of 3 r our
mind, you cannot put it out of the Bible, but there
it will stand as a sealed truth, which you shall expe-
rimentally know for ever, that there is no other way
but, "turn, 01 die,"
what is the matter then that the hearts of si'J-
42 A CALL TO Doct. 1
ners are not pierced with such a weighty truth ? A
man would think now, that every unconverted soul
that hears these words should be pricked to the heart,
and think with himself, ' This is my own case,' ar d
never be quiet till he found himself converted. Believe
it, this drowsy careless temper will not last long. Con-
version and condemnation are both of them awaken •
ing things, and one of them will make you feel ere
long. I can foretell it as truly as if I saw it with my
eyes, that either grace or hell will shortly bring these
matters to the quick, and make you say, " What have
I done? what a foolish wicked course have I taken?"
The scornful and the stupid state of sinners will last
but a little while : as soon as they either turn or die,
the presumptuous dream will be at an end, and then
their wits and feeling will return.
But I foresee there are two things that are likely tc
harden the unconverted, and make me lose all my
labor, except they can be taken out of the way ; and
that is the misunderstanding on those two words, the
wicked and turn. Some will think to themselves,
* It is true, the wicked must turn or die ; but what is
that to me, I am not wicked ; though I am a sinner,
all men arc.' Others will think, ' It is true that we
must turn from our evil ways, but I am turned long
ago ; I hope this is not now to do.* And thus while
wicked men think they are not wicked, but are al-
ready converted, we lose all our labor in persuading
them to turn. I shall therefore, before I go any fur-
ther, tell you here who are meant by the wicked j
and who they are that must turn or die; and also
what is meant by turning, and who they are that are
truly converted. And this I have purposely reserved
for th ; s place, preferring the method that fits my end
Doct.1. THE UNCONVERTED. 43
And here you may observe, that in the sense of the
text, a wicked man and a converted man are contra-
ries. No man is a wicked man that is converted ; and
no man is a converted man that is wicked ; so that to
be a wicked man and to be an unconverted man, is
all one ; and therefore in opening one, we shall open
both.
Before I can tell you what either wickedness or con-
version is, I mu^t go to the bottom, and fetch up the,
matter from the beginning.
It pleased the great Creator of the world to make
ihree sorts of living creatures. Angels he made pure
spirits without flesh, and therefore he made them only
for heaven, and not to dwell on earth. Brutes were
made flesh, without immortal souls, and therefore
they were made only for earth, and not for heaven.
Man is of a middle nature, between both, as partak-
ing of both flesh and spirit, and therefore he was made
both for heaven and earth. But as his flesh is made
to be but a servant to his spirit, so is he made for earth
but as his passage or way to heaven, and not that this
should be his home or happiness. The blessed state
that man was made for, was to behold the glorious
majesty of the Lord, and to praise him among his
Holy Angels, and to love him, and to be filled with
his love for ever. And as this was the end that man
was made for, so God did give him means that were
fitted to the attaining of it. These means were prin-
cipally two : First, the right inclination and disposi-
tion of the mind of man. Secondly, The right order-
ing of his life and practice. For the first, God suited
the disposition of man unto his end, giving him such
knowledge of God as was fit for his present state, and
a heart disposed and inclined to God in holy love. But
44 A CALL TO Doct -
yet he did not fix or confirm him in this condition, but,
having made him a free agent, lie left him in the
hands of his own free will. For the second, God did
that which belonged to him ; that is, he gave him a
perfect law, required him to continue in the love of
God, and perfectly to obey him. By the wilful breach
of this law, man did not only forfeit his hopes of ever-
lasting life, but also turned his heart from God, and
fixed it on these lower fleshly things, and hereby blot-
ted out the spiritual image of God from his soul ; so
that man did both fall short of the glory of God, which
was his end, and put himself out of the way by which
he should have attained it, and this both as to the
frame of his heart, and of his life. The holy inclina-
tion and love of his soul to God, he lost, and instead
of it he contracted an inclination and love to the plea-
sing of his flesh, or carnal self, by earthly things ;
growing strange to God and acquainted with the
creature. And the course of this life was suited to
the bent and inclination of his heart ; he lived to his
carnal self, and not to God ; he sought the creature,
for the pleasing of his flesh, instead of seeking to please
the Lord. With this nature or corrupt inclination,
we are all now born into the world ; " for who can
bring a clean thing out of an unclean ?" Job, 14 : 4.
As a lion hath a fierce and cruel nature before he doth
devour; and an adder hath a venomous nature before
she sting, so in our infancy we have those sinful na-
tures or inclinations, before we think, or speak, or do
amiss. And hence springeth all the sin of our lives;
and not only so, but when God hath, of his mercy, pro-
vided us a remedy, even the Lord Jesus Christ, to be
the Savior of our souls, and bring us back to God
again, we naturally love our present state, and are
Doct. 1. THE UNCONVERTED. 45
loth to be brought out of it, and therefore are set
against the means of our recovery: and though cus-
tom hath taught us to thank Christ for his good-will,
j r et carnal self persuades us to refuse his remedies, and
to desire to be excused when we are commanded to
take the medicines which he offers, and are called to
forsake all and follow him to God and glory.
I pray you read over this leaf again, and mark it ;
for in these few words you have a true description of
our natural state, and consequently of wicked man ;
for every man that is in the state of corrupted nature
is a wicked man, and in a state of death.
By this also you are prepared. to understand what
it is to be converted : to which end you must further
know, that the mercy of God, not willing that man
should perish in his sin, provided a remedy, by caus-
ing his Son to take our nature, and being, in one per-
son, God and man, to become a mediator between
God and man ; and by dying for our sins on the cross,
to ransom us from the curse of God and the power of
the devil. And having thus redeemed us, the Father
hath delivered us into his hands as his own. Here-
upon the Father and the Mediator do make a new
law and covenant for man, not like the first, which
gave life to none but the perfectly obedient, and con-
demned man for every sin ; but Christ hath made a
law of grace, or a promise of pardon and everlasting
life to all that, by true repentance, and by faith in
Christ, are converted unto God ; like an act of oblivion,
which is made by a prince to a company of rebels, on
condition they will lay down their arms and come in
and be loyal subjects for the time to come.
But, because the Lord knoweth that the heart of
man is grown so wicked, that, for all this, men will
46 A CALL TO Oot/L 1
not accept of the remedy if they be left to themselves
therefore the Holy Ghost hath undertaken it as hia
office to inspire the Apostles, and seal the Scriptures
by miracles and wonders, and to illuminate and con-
vert the souls of the elect.
So by this much you see, that as there are three
persons in the Trinity, the Father, the Son, and the
Holy Ghost, so each of these persons have their several
works, which are eminently ascribed to them.
The Father's works were, to create us, to rule us,
as his rational creatures, by the law of nature, and
judge us thereby; and in mercy to provide us a Re-
deemer when we were lost ; and to send his Son, and
accept his ransom.
The works of the Son for us were these : to ransom
and redeem us by his suffering and righteousness; ; to
give out the promise or law of grace, and rule and
judge the world as their Redeemer, on terms of grace :
and to make intercession for us, that the benefits of his
death may be communicated ; and to send the Holy
Ghost, which the Father also doth by the Son.
The works of the Holy Ghost, for us, are these : to
indite the Holy Scriptures, by inspiring araJ guiding
the Apostles, and sealing the word, by his miraculous
gifts and works, and the illuminating and exciting the
ordinary ministers of the gospel, and so enabling them
and helping them to publish that word; and by the
same word illuminating and converting the souls of
men. So that as you could not have been reasonable
creatures, if the Father had not created you, nor have
had any access to God, if the Son bad not died, so
neither can you have a part in Christ, or be saved,
except the Holy Ghost do sanctify you.
So that by this time you may see the several causes
Doct. 1. TOE UNCONVERTED. 47
of this work. The Father sendeth the Son : the Son
redeemeth us and maketh the promise of grace : the
Holy Ghost inditeth and sealeth this Gospel: the
Apostles are the secretaries of the Spirit to write it:
the preachers of the Gospel to proclaim it, and per-
suade men to open it : and the Holy Ghost doth make
their preaching effectual, by opening the hearts of
men to entertain it. And all this to repair the image
of God upon the soul, and to ml the heart upon God
again, and take it off the creature and carnal self to
which it is revolted, and so to turn the current of the
life Into a heavenly course, which before was earthly ;
and through this, embracing Christ by faith, who is
me Physician of the soul.
By what I have said, you may see what it is to be
wicked, and what it is to be converted ; which, I think,
will yet be plainer to you, if I describe them as con-
sisting of their several parts^ And fcr the first, a wicked
man may be known by these three things :
First, He is one who placeth his chief affections on
garth, and loveth the creature more than God, and
his fleshly prosperity above the heavenly felicity. He
savoreth the things of the flesh, but neither discern-
eth n&r savoreth the things of the Spirit; though he
will say, that heaven is better thars earth, yet he doth
not really so esteem it to himself. If he might be sure
of earth, he would let go heaven, and had rather stay
here than be removed thither. A life of perfect holi-
ness in the sight of God, and in his love and praisea
for ever in heaven, doth not find such liking with his
heart as a life of health, and wealth, and honor here
upon earth. And though he falsely profess thai he
loves God above all, yet indeed he never felt the power
of divine love within him, but his mind is more set on
48 A CALL TO Doci. I
worldly or fleshly pleasures than on God. In a word,
whoever loves earth above heaven, and fleshly pros-
perity more than God, is a wicked unconverted man.
On the other hand, a converted man is illuminated
to discern the loveliness of God, and so far believeth
the glory that is to be had with God, that his heart
is taken up with it and set more upon it than any
thing in this world. He had rather see the face of
God, and live in his ev^iasting love and praises, than
have all the wealth or pleasures of the world. He
seeth that all things else are vanity, and nothing but
God can fill the soul ; and therefore let the world go
which way it will, he layeth up his treasures and
hopes in heaven, and for that he is resolved to let go
all. As the fire doth mount upward, and the needle
that is touched with the loadstone still turns to the
north, so the converted soul is inclined unto God. No-
thing else can satisfy him : nor can he find any con-
tent and rest but in his love. In a word, all that are
converted do esteem and love God better than all the
tcorld, and the heavenly felicity is dearer to them
than their fleshly prosperity. The proof of what I
have said you may find in these places of Scriptures:
Phil. 3: 18, 21. Matt. 6 : 19, 20, 21. Col. 3 : 1, 4.
Rom. 8 : 5, 9, 18, 23. Psalm 73 : 25, 26.
Secondly, A wicked man is one that makes ft the
principal business of his life to prosper in the world,
and attain his fleshly ends. And though he may read,
and hear, end do much in the outward duties oC reli-
gion, and forbear disgraceful sins, yet this is all but
by-the-by, and he never makes it the principal busi-
ness of his life to please God, and attain everlast-
ing glory, and puts off God with the leavings of the
world, and gives him no more service than the flesh
Doct L THE UNCONVERTED. 40
can spare, for he will not part with all for heaven.
On the contrary, a converted man is one that makes
it the principal care and business of his life to please
God, and to be saved, and takes all the blessings of
this life but as accommodations in his journey toward
another life, and useth the creature in subordination
to God j he loves a holy life, and longs to be more
holy ; he hath no sin but what he hateth, and longeth,
and prayeth, and striveth to be rid of. The drift and
bent of his life is for God, and if he sin, it is contrary
to the very bent of his heart and life ; and therefore he
riseth again and lamenteth it, and dares not wilfully
live in any imown sin. There is nothing in this world
go dear to him but he can. give it up to God, and for-
sake it for him and the hopes of glory. AH this you
may see in Col. 3 : 1, 5. Matt. 6 : 20, 33. Luke, 18 :
22, 23, 29. Luke, 14 : 18, 24, 26, 27. Rom. 8 : 13.
Gal. 5 : 24. Luke 12 : 21, &c.
Thirdly, The soul of a wicked man did never truly
discern and relish the mystery of redemption, nor
thankfully entertain an offered Savior, nor is he taken
up with the love of the Redeemer, nor willing to be
ruled by him as the Physician of his soul, that he may
be saved from the guilt and power of his sins, and re-
covered to God ; but his heart is insensible of this un-
speakable benefit, and is quite against the healing
means by which he should be recovered. Though he
may be willing to be outwardly religious, yet he never
resigns up Ins soul to Christ, and to the motions and
conduct of his word and Spirit.
On the contrary, the converted soul having felt
himself undone by sin, and perceiving that he hath
lost his peace with God and hopes of heaven, and is in
danger of everlasting misery, doth tliankfallv enter-
Bas. Call. s
50 A CALL TO Doct. 1
tain the tidings of redemption, and believing in the
Lord Jesus as his only Savior, resigns himself up to
him for wisdom, righteousness, sanctification, and re-
demption. He takes Christ as the life of his soul, and
lives by him, and uses him as a salve for every sore,
admiring the wisdom and love of God in this wonder-
ful work of man's redemption. In a word, Christ doth
even dwell in his heart by faith, and the life that he
now liveth, is by the faith of the Son of God, that
loved him, and gave himself for him ; yea, it is not so
much he that liveth, as Christ in him. For these,
see Job, 1 : 11, 12; and 3 : 19, 20. Rom. 8 : 9. Phil.
3 : 7, 10. Gal. 2 : 2Q. Job, 15 : 2, 3, 4. 1 Cor. 1 : 20.
2:2.
You see now, in plain terms from the Word of God,
who are the wicked and who are the converted. Igno-
rant people think, that if a man be no swearer, nor
curser, nor railer, nor drunkard, nor fornicator, nor ex-
tortioner, nor wrong any body in his dealings, and if
he come to church and say his prayers, he cannot be
a wicked man. Or if a man that hath been guilty
of drunkenness, swearing, or gaming, or the like vices,
do but forbear them for the time to come, they think
that this is a converted man. Others think if a man
that hath been an enemy, and scorner at godliness,
do but approve it, and be hated for it by the wicked,
as the godly are, that this must needs be a converted
man. And some are so foolish as to think that they
are converted by taking up some new opinion, and
falling into some dividing party. And some think,
if they have but been affrighted by the fears of hell,
and had convictions of conscience, and thereupon
have purposed and promised amendment, and take up
a life of civil behavior and outward religion, that this
Doct. 1. THE UNCONVERTED. 51
must needs be true conversion. And these are the
poor deluded souls that are like to lose the benefit of
all our persuasions j and when they hear that the
wicked must turn or die, they think that this is not
3poken to them, for they are not wicked, but are turned
already. And therefore it is that Christ told some of
the rulers of the Jews who were greater and more
civil than the common people, that " publicans and
harlots go into the kingdom of Christ before them."
Matt. 21 : 31. Not that a harlot, or gross sinner can
be saved without conversion ; but because it was easier
to make these gross sinners perceive their sin and mi-
sery, and the necessity of a change, than the more
civil sort, who delude themselves by thinking that
they are converted already, when they are not.
O sirs, conversion is another kind of work than most
are aware of. It is not a small matter to bring an
earthly mind to heaven, and to show man the amiable
excellence of God, till he be taken up in such love to
him that can never be quenched ; to break the heart
for sin, and make him fly for refuge to Christ, and
thankfully embrace him as the life of his soul ; to have
the very drift and bent of the heart and life changed ;
so that a man renounceth that which he took for his
felicity, and placeth his felicity where he never did
before, and lives not to the same end, and drives not
on the same design in the world, as he formerly did.
In a word, he that is in Christ is a " new creature :
old things are passed aAvay : behold, all things are
become new." 2 Cor. 5 : 17. He hath a new under-
standing, a new will and resolution, new sorrows, and
desires, and love, and delight; new thoughts, new
speeches, new company, (if possible,) and a new con-
versation. Sin, that before was a jesting matter witj?
52 A CALL TO Doct. 1
him. is now so odious and terrible to him that he flies
from it as from death. The world, that was so lovely
in his eyes, doth now appear but as vanity and vexa-
tion : God, that was before neglected, is now the only
happiness of his soul : before he was forgotten, and
every lust preferred before him, but now he is set next
the heart, and all things must give place to him ; the
heart is taken u
gitextract_vbmxaw27/
├── .Rbuildignore
├── .gitignore
├── .travis.yml
├── CONDUCT.md
├── DESCRIPTION
├── LICENSE
├── Makefile
├── NAMESPACE
├── NEWS.md
├── R/
│ ├── RcppExports.R
│ ├── TextReuseCorpus.R
│ ├── TextReuseTextDocument.R
│ ├── align_local.R
│ ├── conversion-functions.R
│ ├── filenames.R
│ ├── lsh.R
│ ├── lsh_candidates.R
│ ├── lsh_compare.R
│ ├── lsh_probability.R
│ ├── lsh_query.R
│ ├── lsh_subset.R
│ ├── minhash.R
│ ├── pairwise_candidates.R
│ ├── pairwise_compare.R
│ ├── parallel.R
│ ├── rehash.R
│ ├── similarity.R
│ ├── textreuse-package.r
│ ├── token_index.R
│ ├── tokenize.R
│ ├── tokenizers.R
│ ├── utils.R
│ └── wordcount.R
├── README.Rmd
├── README.md
├── _pkgdown.yml
├── appveyor.yml
├── cran-comments.md
├── inst/
│ └── extdata/
│ ├── ats/
│ │ ├── calltounconv00baxt.txt
│ │ ├── gospeltruth00whit.txt
│ │ ├── lifeofrevrichard00baxt.txt
│ │ ├── memoirjamesbrai00ricegoog.txt
│ │ ├── practicalthought00nev.txt
│ │ ├── remember00palm.txt
│ │ ├── remembermeorholy00palm.txt
│ │ └── thoughtsonpopery00nevi.txt
│ └── legal/
│ ├── ca1851-match.txt
│ ├── ca1851-nomatch.txt
│ └── ny1850-match.txt
├── man/
│ ├── TextReuseCorpus.Rd
│ ├── TextReuseTextDocument-accessors.Rd
│ ├── TextReuseTextDocument.Rd
│ ├── align_local.Rd
│ ├── as.matrix.textreuse_candidates.Rd
│ ├── filenames.Rd
│ ├── hash_string.Rd
│ ├── lsh.Rd
│ ├── lsh_add.Rd
│ ├── lsh_candidates.Rd
│ ├── lsh_compare.Rd
│ ├── lsh_probability.Rd
│ ├── lsh_query.Rd
│ ├── lsh_subset.Rd
│ ├── minhash_generator.Rd
│ ├── pairwise_candidates.Rd
│ ├── pairwise_compare.Rd
│ ├── reexports.Rd
│ ├── rehash.Rd
│ ├── similarity-functions.Rd
│ ├── textreuse-package.Rd
│ ├── token_index.Rd
│ ├── token_index_candidates.Rd
│ ├── tokenize.Rd
│ ├── tokenizers.Rd
│ └── wordcount.Rd
├── pkgdown/
│ └── extra.css
├── src/
│ ├── RcppExports.cpp
│ ├── hash_string.cpp
│ ├── shingle_ngrams.cpp
│ ├── skip_ngrams.cpp
│ └── sw_matrix.cpp
├── tests/
│ ├── testthat/
│ │ ├── newman.txt
│ │ ├── test-TextReuseCorpus.R
│ │ ├── test-TextReuseTextDocument.R
│ │ ├── test-alignment.R
│ │ ├── test-candidate_pairs.R
│ │ ├── test-filenames.R
│ │ ├── test-hashing.R
│ │ ├── test-jaccard.R
│ │ ├── test-lsh.R
│ │ ├── test-minhash.R
│ │ ├── test-pairwise_cf.R
│ │ ├── test-ratio_of_matches.R
│ │ ├── test-token_index.R
│ │ ├── test-tokenizers.R
│ │ ├── test-utils.R
│ │ └── test-wordcount.R
│ └── testthat.R
└── vignettes/
├── textreuse-alignment.Rmd
├── textreuse-introduction.Rmd
├── textreuse-minhash.Rmd
└── textreuse-pairwise.Rmd
SYMBOL INDEX (9 symbols across 5 files)
FILE: src/RcppExports.cpp
function RcppExport (line 15) | RcppExport SEXP _textreuse_hash_string(SEXP xSEXP) {
function RcppExport (line 26) | RcppExport SEXP _textreuse_shingle_ngrams(SEXP wordsSEXP, SEXP nSEXP) {
function RcppExport (line 38) | RcppExport SEXP _textreuse_skip_ngrams(SEXP wordsSEXP, SEXP nSEXP, SEXP ...
function RcppExport (line 51) | RcppExport SEXP _textreuse_sw_matrix(SEXP mSEXP, SEXP aSEXP, SEXP bSEXP,...
function RcppExport (line 75) | RcppExport void R_init_textreuse(DllInfo *dll) {
FILE: src/hash_string.cpp
function IntegerVector (line 13) | IntegerVector hash_string(std::vector < std::string > x) {
FILE: src/shingle_ngrams.cpp
function CharacterVector (line 6) | CharacterVector shingle_ngrams(CharacterVector words, int n) {
FILE: src/skip_ngrams.cpp
function CharacterVector (line 8) | CharacterVector skip_ngrams(CharacterVector words, int n, int k) {
FILE: src/sw_matrix.cpp
function IntegerMatrix (line 7) | IntegerMatrix sw_matrix(IntegerMatrix m, CharacterVector a, CharacterVec...
Condensed preview — 102 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (3,293K chars).
[
{
"path": ".Rbuildignore",
"chars": 204,
"preview": "^.*\\.Rproj$\n^\\.Rproj\\.user$\n^\\.git$\n^\\.r-lib$\n^README\\.Rmd$\n^README-*\\.png$\n^data-raw$\n^\\.travis\\.yml$\nwordnet\n^appveyor"
},
{
"path": ".gitignore",
"chars": 82,
"preview": ".Rproj\n*.Rproj\n.Rproj.user\n.Rhistory\n.RData\n.Ruserdata\nsrc/*.o\nsrc/*.so\nsrc/*.dll\n"
},
{
"path": ".travis.yml",
"chars": 391,
"preview": "language: r\nr:\n - oldrel\n - release\n - devel\nsudo: false\ncache: packages\n\nafter_success:\n - Rscript -e 'covr::codeco"
},
{
"path": "CONDUCT.md",
"chars": 1387,
"preview": "# Contributor Code of Conduct\n\nAs contributors and maintainers of this project, we pledge to respect all people who \ncon"
},
{
"path": "DESCRIPTION",
"chars": 1394,
"preview": "Package: textreuse\nType: Package\nTitle: Detect Text Reuse and Document Similarity\nVersion: 1.0.1\nDate: 2026-05-06\nAuthor"
},
{
"path": "LICENSE",
"chars": 60,
"preview": "YEAR: 2026\nCOPYRIGHT HOLDER: Yaoxiang Li and Lincoln Mullen\n"
},
{
"path": "Makefile",
"chars": 216,
"preview": ".PHONY : docs deploy-docs\n\ndocs :\n\tRscript -e \"pkgdown::clean_site(); pkgdown::build_site(run_dont_run = TRUE)\"\n\ndeploy-"
},
{
"path": "NAMESPACE",
"chars": 3227,
"preview": "# Generated by roxygen2: do not edit by hand\n\nS3method(\"[\",TextReuseCorpus)\nS3method(\"[[\",TextReuseCorpus)\nS3method(\"con"
},
{
"path": "NEWS.md",
"chars": 3346,
"preview": "# textreuse 1.0.1\n\nThis release brings together several years of maintenance and feature work to\nmake textreuse easier t"
},
{
"path": "R/RcppExports.R",
"chars": 750,
"preview": "# Generated by using Rcpp::compileAttributes() -> do not edit by hand\n# Generator token: 10BE3573-1514-4C36-9D1C-5A225CD"
},
{
"path": "R/TextReuseCorpus.R",
"chars": 8945,
"preview": "#' TextReuseCorpus\n#'\n#' This is the constructor function for a \\code{TextReuseCorpus}, modeled on the\n#' virtual S3 cla"
},
{
"path": "R/TextReuseTextDocument.R",
"chars": 11127,
"preview": "#' TextReuseTextDocument\n#'\n#' This is the constructor function for \\code{TextReuseTextDocument} objects.\n#' This class "
},
{
"path": "R/align_local.R",
"chars": 10565,
"preview": "#' Local alignment of natural language texts\n#'\n#' This function takes two texts, either as strings or as\n#' \\code{TextR"
},
{
"path": "R/conversion-functions.R",
"chars": 1557,
"preview": "#' Convert candidates data frames to other formats\n#'\n#' These functions convert a \\code{textreuse_candidates} object to"
},
{
"path": "R/filenames.R",
"chars": 820,
"preview": "#' Filenames from paths\n#'\n#' This function takes a character vector of paths and returns just the file\n#' name, by defa"
},
{
"path": "R/lsh.R",
"chars": 6690,
"preview": "#'Locality sensitive hashing for minhash\n#'\n#'Locality sensitive hashing (LSH) discovers potential matches among a corpu"
},
{
"path": "R/lsh_candidates.R",
"chars": 1294,
"preview": "#' Candidate pairs from LSH comparisons\n#'\n#' Given a data frame of LSH buckets returned from \\code{\\link{lsh}}, this\n#'"
},
{
"path": "R/lsh_compare.R",
"chars": 2656,
"preview": "#' Compare candidates identified by LSH\n#'\n#' The \\code{\\link{lsh_candidates}} only identifies potential matches, but\n#'"
},
{
"path": "R/lsh_probability.R",
"chars": 2222,
"preview": "#' Probability that a candidate pair will be detected with LSH\n#'\n#' Functions to help choose the correct parameters for"
},
{
"path": "R/lsh_query.R",
"chars": 1397,
"preview": "#' Query a LSH cache for matches to a single document\n#'\n#' This function retrieves the matches for a single document fr"
},
{
"path": "R/lsh_subset.R",
"chars": 828,
"preview": "#' List of all candidates in a corpus\n#'\n#' @param candidates A data frame of candidate pairs from\n#' \\code{\\link{lsh_"
},
{
"path": "R/minhash.R",
"chars": 2597,
"preview": "#' Generate a minhash function\n#'\n#' A minhash value is calculated by hashing the strings in a character vector to\n#' in"
},
{
"path": "R/pairwise_candidates.R",
"chars": 1406,
"preview": "#' Candidate pairs from pairwise comparisons\n#'\n#' Converts a comparison matrix generated by \\code{\\link{pairwise_compar"
},
{
"path": "R/pairwise_compare.R",
"chars": 2886,
"preview": "#' Pairwise comparisons among documents in a corpus\n#'\n#' Given a \\code{\\link{TextReuseCorpus}} containing documents of "
},
{
"path": "R/parallel.R",
"chars": 413,
"preview": "# Check if the option `mc.cores` has been set. If it has, return `mclapply`\n# instead of `lapply`. But in no circumstanc"
},
{
"path": "R/rehash.R",
"chars": 2119,
"preview": "#' Recompute the hashes for a document or corpus\n#'\n#' Given a \\code{\\link{TextReuseTextDocument}} or a\n#' \\code{\\link{T"
},
{
"path": "R/similarity.R",
"chars": 6157,
"preview": "#' Measure similarity/dissimilarity in documents\n#'\n#' A set of functions which take two sets or bag of words and measur"
},
{
"path": "R/textreuse-package.r",
"chars": 2099,
"preview": "#' @details\n#' The best place to begin with this package in the introductory vignette.\n#'\n#' \\code{vignette(\"textreuse-i"
},
{
"path": "R/token_index.R",
"chars": 2765,
"preview": "#' Build an index of tokens and documents\n#'\n#' Build an inverted index from tokens to the documents that contain them. "
},
{
"path": "R/tokenize.R",
"chars": 3216,
"preview": "#' Recompute the tokens for a document or corpus\n#'\n#' Given a \\code{\\link{TextReuseTextDocument}} or a\n#' \\code{\\link{T"
},
{
"path": "R/tokenizers.R",
"chars": 2132,
"preview": "#' Split texts into tokens\n#'\n#' These functions each turn a text into tokens. The \\code{tokenize_ngrams}\n#' functions r"
},
{
"path": "R/utils.R",
"chars": 2720,
"preview": "# Take results of readLines and turn it into a character vector of length 1\nas_string <- function(x) {\n x %>%\n str_c"
},
{
"path": "R/wordcount.R",
"chars": 708,
"preview": "#' Count words\n#'\n#' This function counts words in a text, for example, a character vector, a\n#' \\code{\\link{TextReuseTe"
},
{
"path": "README.Rmd",
"chars": 7396,
"preview": "---\noutput: md_document\ntitle: Detect Text Reuse and Document Similarity\n---\n\n<!-- README.md is generated from README.Rm"
},
{
"path": "README.md",
"chars": 9205,
"preview": "<!-- README.md is generated from README.Rmd. Please edit that file -->\n\n# textreuse\n\n[ -> do not edit by hand\n// Generator token: 10BE3573-1514-4C36-9D1C-5A225"
},
{
"path": "src/hash_string.cpp",
"chars": 654,
"preview": "#include <Rcpp.h>\n#include <boost/functional/hash.hpp>\nusing namespace Rcpp;\n\n//' Hash a string to an integer\n//' @param"
},
{
"path": "src/shingle_ngrams.cpp",
"chars": 618,
"preview": "#include <Rcpp.h>\nusing namespace Rcpp;\n\n// Create shingled n-grams\n// [[Rcpp::export]]\nCharacterVector shingle_ngrams(C"
},
{
"path": "src/skip_ngrams.cpp",
"chars": 1496,
"preview": "#include <Rcpp.h>\nusing namespace Rcpp;\n\n// Skip n-grams\n// @param n = number of words in an n-gram\n// @param k = max nu"
},
{
"path": "src/sw_matrix.cpp",
"chars": 908,
"preview": "#include <progress.hpp>\n#include <Rcpp.h>\nusing namespace Rcpp;\n\n// [[Rcpp::depends(RcppProgress)]]\n// [[Rcpp::export]]\n"
},
{
"path": "tests/testthat/newman.txt",
"chars": 2015,
"preview": "And now that I am about to trace, as far as I can, the course of that \ngreat revolution of mind, which led me to leave m"
},
{
"path": "tests/testthat/test-TextReuseCorpus.R",
"chars": 5539,
"preview": "context(\"TextReuseCorpus\")\n\nny <- system.file(\"extdata/legal/ny1850-match.txt\", package = \"textreuse\")\nca1 <- system.fil"
},
{
"path": "tests/testthat/test-TextReuseTextDocument.R",
"chars": 4411,
"preview": "context(\"TextReuseTextDocument\")\n\ndoc <- TextReuseTextDocument(file = \"newman.txt\", keep_tokens = TRUE)\ntest_meta <- lis"
},
{
"path": "tests/testthat/test-alignment.R",
"chars": 1635,
"preview": "context(\"Alignment\")\n\ntest_that(\"returns correct results with edits properly marked\", {\n a <- \"How can we tell if this "
},
{
"path": "tests/testthat/test-candidate_pairs.R",
"chars": 1075,
"preview": "context(\"Candidate pairs\")\ndir <- system.file(\"extdata/legal\", package = \"textreuse\")\ncorpus <- TextReuseCorpus(dir = di"
},
{
"path": "tests/testthat/test-filenames.R",
"chars": 338,
"preview": "context(\"Filenames\")\n\npaths <- c(\"corpus/one.txt\", \"deep/corpus/two.R\", \"~/home/three.markdown\",\n \"/corpus/fou"
},
{
"path": "tests/testthat/test-hashing.R",
"chars": 562,
"preview": "context(\"Hashing\")\n\nlines <- system.file(\"extdata/legal/ny1850-match.txt\", package = \"textreuse\") %>%\n readLines()\nngr"
},
{
"path": "tests/testthat/test-jaccard.R",
"chars": 1186,
"preview": "context(\"Jaccard coefficients\")\n\ntest_that(\"calculates the similarity coefficient correctly\", {\n expect_equal(jaccard_s"
},
{
"path": "tests/testthat/test-lsh.R",
"chars": 2887,
"preview": "context(\"LSH\")\n\ndir <- system.file(\"extdata/legal\", package = \"textreuse\")\nminhash <- minhash_generator(200, seed = 9228"
},
{
"path": "tests/testthat/test-minhash.R",
"chars": 1088,
"preview": "context(\"Minhash\")\n\nmhash <- minhash_generator()\nfile <- system.file(\"extdata/legal/ny1850-match.txt\", package = \"textre"
},
{
"path": "tests/testthat/test-pairwise_cf.R",
"chars": 756,
"preview": "context(\"Pairwise comparison\")\n\ndir <- system.file(\"extdata/legal\", package = \"textreuse\")\ncorpus <- TextReuseCorpus(dir"
},
{
"path": "tests/testthat/test-ratio_of_matches.R",
"chars": 1771,
"preview": "context(\"Ratio of matches\")\n\ntest_that(\"calculates the value correctly\", {\n expect_equal(ratio_of_matches(1:4, 3:5), 2/"
},
{
"path": "tests/testthat/test-token_index.R",
"chars": 1018,
"preview": "context(\"Token index\")\n\ntexts <- c(a = \"one two three four\",\n b = \"one two three five\",\n c = \"six se"
},
{
"path": "tests/testthat/test-tokenizers.R",
"chars": 3035,
"preview": "context(\"Tokenizers\")\n\nsentence <- \"This is a sentence which has a number of words in it; also some\n tricky "
},
{
"path": "tests/testthat/test-utils.R",
"chars": 955,
"preview": "context(\"Utils\")\n\ntest_that(\"as_string returns the correct type\", {\n s <- as_string(c(\"First\", \"Second\"))\n expect_is(s"
},
{
"path": "tests/testthat/test-wordcount.R",
"chars": 517,
"preview": "context(\"Word counts\")\n\ndir <- system.file(\"extdata/legal\", package = \"textreuse\")\ncorpus <- TextReuseCorpus(dir = dir)\n"
},
{
"path": "tests/testthat.R",
"chars": 62,
"preview": "library(testthat)\nlibrary(textreuse)\n\ntest_check(\"textreuse\")\n"
},
{
"path": "vignettes/textreuse-alignment.Rmd",
"chars": 2126,
"preview": "---\ntitle: \"Text Alignment\"\nauthor:\n - \"Lincoln Mullen\"\n - \"Yaoxiang Li\"\ndate: \"`r Sys.Date()`\"\noutput: rmarkdown::htm"
},
{
"path": "vignettes/textreuse-introduction.Rmd",
"chars": 8488,
"preview": "---\ntitle: \"Introduction to the textreuse package\"\nauthor:\n - \"Lincoln Mullen\"\n - \"Yaoxiang Li\"\ndate: \"`r Sys.Date()`\""
},
{
"path": "vignettes/textreuse-minhash.Rmd",
"chars": 6516,
"preview": "---\ntitle: \"Minhash and locality-sensitive hashing\"\nauthor:\n - \"Lincoln Mullen\"\n - \"Yaoxiang Li\"\ndate: \"`r Sys.Date()`"
},
{
"path": "vignettes/textreuse-pairwise.Rmd",
"chars": 2248,
"preview": "---\ntitle: \"Pairwise comparisons for document similarity\"\nauthor:\n - \"Lincoln Mullen\"\n - \"Yaoxiang Li\"\ndate: \"`r Sys.D"
}
]
About this extraction
This page contains the full source code of the ropensci/textreuse GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 102 files (3.0 MB), approximately 801.6k tokens, and a symbol index with 9 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.