Repository: ropensci/tesseract
Branch: master
Commit: eb79775ec4fd
Files: 32
Total size: 60.3 KB

Directory structure:
gitextract_z8kqhkl4/

├── .Rbuildignore
├── .github/
│   ├── .gitignore
│   └── workflows/
│       └── R-CMD-check.yaml
├── .gitignore
├── DESCRIPTION
├── NAMESPACE
├── NEWS
├── R/
│   ├── RcppExports.R
│   ├── ocr.R
│   ├── onload.R
│   ├── tessdata.R
│   └── tesseract.R
├── README.md
├── cleanup
├── configure
├── configure.win
├── inst/
│   ├── AUTHORS
│   ├── COPYRIGHT
│   └── WORDLIST
├── man/
│   ├── ocr.Rd
│   ├── tessdata.Rd
│   └── tesseract.Rd
├── src/
│   ├── Makevars.in
│   ├── Makevars.win
│   ├── RcppExports.cpp
│   ├── tesseract.cpp
│   └── tesseract_types.h
├── tesseract.Rproj
├── tests/
│   └── spelling.R
├── tools/
│   ├── test.cpp
│   └── winlibs.R
└── vignettes/
    └── intro.Rmd

================================================
FILE CONTENTS
================================================

================================================
FILE: .Rbuildignore
================================================
^.*\.Rproj$
^\.Rproj\.user$
^src/Makevars$
^windows
\.pdf$
\.png$
\.webp$
\.jpeg$
\.o$
\.dll$
^\.travis\.yml$
^appveyor\.yml$
^README.md$
vignettes/.*\.png$
^configure.log$
^\.github$
^\.deps$


================================================
FILE: .github/.gitignore
================================================
*.html


================================================
FILE: .github/workflows/R-CMD-check.yaml
================================================
# Workflow derived from https://github.com/r-lib/actions/tree/v2/examples
# Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help
on:
  push:
  pull_request:

name: R-CMD-check.yaml

permissions: read-all

jobs:
  R-CMD-check:
    runs-on: ${{ matrix.config.os }}

    name: ${{ matrix.config.os }} (${{ matrix.config.r }})

    strategy:
      fail-fast: false
      matrix:
        config:
          - {os: macos-15-intel,  r: 'release'}
          - {os: macos-latest,    r: 'next'}
          - {os: windows-latest , r: '4.1'}
          - {os: windows-latest , r: '4.2'}
          - {os: windows-latest , r: '4.3'}
          - {os: windows-latest , r: '4.4'}
          - {os: windows-latest , r: 'devel'}
          - {os: ubuntu-latest,   r: 'devel', http-user-agent: 'release'}
          - {os: ubuntu-latest,   r: 'release'}
          - {os: ubuntu-latest,   r: 'oldrel-4'}

    env:
      GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
      R_KEEP_PKG_SOURCE: yes

    steps:
      - uses: actions/checkout@v4

      - uses: r-lib/actions/setup-pandoc@v2

      - uses: r-lib/actions/setup-r@v2
        with:
          r-version: ${{ matrix.config.r }}
          http-user-agent: ${{ matrix.config.http-user-agent }}
          use-public-rspm: true

      - uses: r-lib/actions/setup-r-dependencies@v2
        with:
          extra-packages: any::rcmdcheck
          needs: check

      - uses: r-lib/actions/check-r-package@v2
        env:
          MAKEFLAGS: -j4


================================================
FILE: .gitignore
================================================
*.o
*.so
*.dll
*.a
*.txt
*.pdf
*.png
*.webp
*.jpeg
.Rproj.user
.Rhistory
inst/tessdata
windows
src/Makevars
configure.log


================================================
FILE: DESCRIPTION
================================================
Package: tesseract
Type: Package
Title: Open Source OCR Engine
Version: 5.2.5
Authors@R: person("Jeroen", "Ooms", role = c("aut", "cre"), email = "jeroenooms@gmail.com",
    comment = c(ORCID = "0000-0002-4035-0289"))
Description: Bindings to 'Tesseract': 
     a powerful optical character recognition (OCR) engine that supports over 100 languages.
     The engine is highly configurable in order to tune the detection algorithms and
     obtain the best possible results.
License: Apache License 2.0
URL: https://docs.ropensci.org/tesseract/
    https://ropensci.r-universe.dev/tesseract
BugReports: https://github.com/ropensci/tesseract/issues
SystemRequirements: Tesseract >= 3.03 (libtesseract-dev / tesseract-devel) and
    Leptonica (libleptonica-dev / leptonica-devel). On Debian you need to install
    the English training data separately (tesseract-ocr-eng)
Imports:
    Rcpp (>= 0.12.12),
    pdftools (>= 1.5),    
    curl,
    rappdirs,
    digest
LinkingTo: Rcpp
RoxygenNote: 7.3.3
Roxygen: list(markdown = TRUE)
Suggests:
    magick (>= 1.7),
    spelling,
    knitr,
    tibble,
    rmarkdown
Encoding: UTF-8
VignetteBuilder: knitr
Language: en-US


================================================
FILE: NAMESPACE
================================================
# Generated by roxygen2: do not edit by hand

S3method(print,tesseract)
export(ocr)
export(ocr_data)
export(tesseract)
export(tesseract_download)
export(tesseract_info)
export(tesseract_params)
importFrom(Rcpp,sourceCpp)
useDynLib(tesseract)


================================================
FILE: NEWS
================================================
5.2.5
  - Wrap examples in donttest for cran policies

5.2.4
  - Do not use CXX11 anymore in configure script (fixes R-4.6)

5.2.1
  - Fix shell script for cross compilation

5.2.0
  - Windows: update to tesseract 5.3.2

5.1.0
  - Win: update to tesseract 5.1.0.
  - Win: apply patch for freezes when running under UTF-8 in R-4.2.
    See: https://github.com/tesseract-ocr/tesseract/issues/3830

5.0.0
  - Win/Mac: update to libtesseract 5.0.1
  - Remove locale workaround on libtesseract 4.1+ (should only be needed for 4.0)
  - Remove cruft that was needed to support Solaris

4.2.0
  - Prepare for API changes in upcoming Tesseract 5 release
  - Change the default language="eng" in tesseract()

4.1.2
  - Fix for upstream master/main renames in language repos

4.1.1
  - Win/Mac: update to libtesseract 4.1.1

4.1
  - Fix memory leak in ocr_data()
  - Windows / MacOS: update to libtesseract 4.1.0. This re-enables
    the whitelist/blacklist options that were missing in Tesseract 4.0

4.0
  - Windows, MacOS: Upgrade to upstream Tesseract 4.0! Completely new OCR engine.
  - Tesseract 4 has a new training data format. On Windows / MacOS you need to
    re-download your language data with tesseract_download(). The package uses
    separate directories for storing Tesseract 3 vs 4 data so they shouldn't get
    mixed up (hopefully).
  - Drop hard-dependency on tibble (only load if available)

2.3
  - Fix problem with setlocale() not properly restoring locale.
  - Switch examples from dontrun{} to donttest{}, and '--run-donttest' on travis/appveyor

2.2
  - Fixes for breaking changes in Tesseract 4.0.0 beta.3
  - Set LC_ALL = C when initiating tesseract
  - Include <tesseract/*> to support Tesseract 4

2.1
  - Fixes for 4.0.0-beta.1: they switched to semver + other data branch
  - Set LC_CTYPE to "C" when loading training data (required for some asian languages)
  - Add back OSD training data on Windows

2.0
  - Set tesseract parameters at init so that all parameters types now actually work!
  - New function tesseract_params() lists all supported parameters and their default
  - Added 'config' argument to tesseract() which specifies a file with parameter values
  - Internally validate paremeter names before init to revent tesseract crashes
  - Rewrite the ocr_data() function in C++ to make it much faster
  - Tesseract 4 now gets data from the tessdata_fast repo as recommended upstream
  - Use default resolution of 300dpi when image does not contain resolution info

1.9
  - Tesseract 4 now dowloads training data from the "tessdata_fast" repo
  - Add ocr_data() function that parses the hOCR output

1.8
  - Add support for HOCR output (#20)
  - Remove 'script' and 'orientation' attributes in output (doesn't work in Tesseract 4)

1.7 (internal)
  - Add support upcoming Tesseract 4 (compiler fix + separate tessdata dir)
  - Configure script now explicitly tests for CXX11 (required by Tesseract 4)

1.6
  - Windows: update libtesseract to 3.05.01
  - tesseract_download now uses 3.04 tree (instead of 4.00) as suggested in readme
  - For static packags on Win/Mac, languages stored in: rappdirs::user_data_dir('tesseract')
  - Use 'png' instead of 'tiff' to read magick images
  - Compile with $(C_VISIBILITY) to hide internal symbols (requires Rcpp 0.12.12)
  - Use Rcpp symbol registration

1.4
  - Run engine finalizer on R exit (requires Rcpp 0.12.10)
  - Move autobrew script to separate repository
  - Add symbol registration

1.3
  - tesseract() gains an 'options' parameter for setting engine variables
  - New tessseract_download() function for installing training data on Win/Mac
  - Initiate default tesseract engine onAttach() to fail for missing training data
  - Add support for ocr() on magick images

1.2
  - Try to fix build for CRAN OS-X, again.

1.1
  - Try to fix build for CRAN OS-X build server
  - Show 'loaded' and 'available' languages in print.tesseract()

1.0
  - Initial CRAN release


================================================
FILE: R/RcppExports.R
================================================
# Generated by using Rcpp::compileAttributes() -> do not edit by hand
# Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393

tesseract_config <- function() {
    .Call('_tesseract_tesseract_config', PACKAGE = 'tesseract')
}

tesseract_engine_internal <- function(datapath, language, confpaths, opt_names, opt_values) {
    .Call('_tesseract_tesseract_engine_internal', PACKAGE = 'tesseract', datapath, language, confpaths, opt_names, opt_values)
}

tesseract_engine_set_variable <- function(ptr, name, value) {
    .Call('_tesseract_tesseract_engine_set_variable', PACKAGE = 'tesseract', ptr, name, value)
}

validate_params <- function(params) {
    .Call('_tesseract_validate_params', PACKAGE = 'tesseract', params)
}

engine_info_internal <- function(ptr) {
    .Call('_tesseract_engine_info_internal', PACKAGE = 'tesseract', ptr)
}

print_params <- function(filename) {
    .Call('_tesseract_print_params', PACKAGE = 'tesseract', filename)
}

get_param_values <- function(ptr, params) {
    .Call('_tesseract_get_param_values', PACKAGE = 'tesseract', ptr, params)
}

ocr_raw <- function(input, ptr, HOCR = FALSE) {
    .Call('_tesseract_ocr_raw', PACKAGE = 'tesseract', input, ptr, HOCR)
}

ocr_file <- function(file, ptr, HOCR = FALSE) {
    .Call('_tesseract_ocr_file', PACKAGE = 'tesseract', file, ptr, HOCR)
}

ocr_raw_data <- function(input, ptr) {
    .Call('_tesseract_ocr_raw_data', PACKAGE = 'tesseract', input, ptr)
}

ocr_file_data <- function(file, ptr) {
    .Call('_tesseract_ocr_file_data', PACKAGE = 'tesseract', file, ptr)
}


================================================
FILE: R/ocr.R
================================================
#' Tesseract OCR
#'
#' Extract text from an image. Requires that you have training data for the language you
#' are reading. Works best for images with high contrast, little noise and horizontal text.
#' See [tesseract wiki](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality) and
#' our package vignette for image preprocessing tips.
#'
#' The `ocr()` function returns plain text by default, or hOCR text if hOCR is set to `TRUE`.
#' The `ocr_data()` function returns a data frame with a confidence rate and bounding box for
#' each word in the text.
#'
#' @export
#' @useDynLib tesseract
#' @family tesseract
#' @param image file path, url, or raw vector to image (png, tiff, jpeg, etc)
#' @param engine a tesseract engine created with [tesseract()]. Alternatively a
#' language string which will be passed to [tesseract()].
#' @param HOCR if `TRUE` return results as HOCR xml instead of plain text
#' @rdname ocr
#' @references [Tesseract: Improving Quality](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality)
#' @importFrom Rcpp sourceCpp
#' @examples \donttest{
#' text <- ocr("https://jeroen.github.io/images/testocr.png")
#' cat(text)
#'
#' xml <- ocr("https://jeroen.github.io/images/testocr.png", HOCR = TRUE)
#' cat(xml)
#'
#' df <- ocr_data("https://jeroen.github.io/images/testocr.png")
#' print(df)
#'
#' # Full roundtrip test: render PDF to image and OCR it back to text
#' curl::curl_download("https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf", "R-intro.pdf")
#' orig <- pdftools::pdf_text("R-intro.pdf")[1]
#'
#' # Render pdf to png image
#' img_file <- pdftools::pdf_convert("R-intro.pdf", format = 'tiff', pages = 1, dpi = 400)
#' unlink("R-intro.pdf")
#'
#' # Extract text from png image
#' text <- ocr(img_file)
#' unlink(img_file)
#' cat(text)
#' }
#'
#' engine <- tesseract(options = list(tessedit_char_whitelist = "0123456789"))
ocr <- function(image, engine = tesseract("eng"), HOCR = FALSE) {
  if(is.character(engine))
    engine <- tesseract(engine)
  stopifnot(inherits(engine, "tesseract"))
  if(inherits(image, "magick-image")){
    vapply(image, function(x){
      tmp <- tempfile(fileext = ".png")
      on.exit(unlink(tmp))
      magick::image_write(x, tmp, format = 'PNG', density = '300x300')
      ocr(tmp, engine = engine, HOCR = HOCR)
    }, character(1))
  } else if(is.character(image)){
    image <- download_files(image)
    vapply(image, ocr_file, character(1), ptr = engine, HOCR = HOCR, USE.NAMES = FALSE)
  } else if(is.raw(image)){
    ocr_raw(image, engine, HOCR = HOCR)
  } else {
    stop("Argument 'image' must be file-path, url or raw vector")
  }
}

#' @rdname ocr
#' @export
ocr_data <- function(image, engine = tesseract("eng")) {
  if(is.character(engine))
    engine <- tesseract(engine)
  stopifnot(inherits(engine, "tesseract"))
  df_list <- if(inherits(image, "magick-image")){
    lapply(image, function(x){
      tmp <- tempfile(fileext = ".png")
      on.exit(unlink(tmp))
      magick::image_write(x, tmp, format = 'PNG', density = '300x300')
      ocr_data(tmp, engine = engine)
    })
  } else if(is.character(image)){
    image <- download_files(image)
    lapply(image, function(im){
      ocr_file_data(im, ptr = engine)
    })
  } else if(is.raw(image)){
    list(ocr_raw_data(image, engine))
  } else {
    stop("Argument 'image' must be file-path, url or raw vector")
  }
  df_as_tibble(do.call(rbind.data.frame, unname(df_list)))
}


================================================
FILE: R/onload.R
================================================
.onLoad <- function(lib, pkg){
  pkgdir <- file.path(lib, pkg)
  version <- tesseract_version_major()
  appname <- ifelse(version < 4, "tesseract", paste0("tesseract", version))
  sysdir <- rappdirs::user_data_dir(appname)
  pkgdata <- normalizePath(file.path(pkgdir, "tessdata"), mustWork = FALSE)
  sysdata <- normalizePath(file.path(sysdir, "tessdata"), mustWork = FALSE)
  if(!is_testload() && file.exists(pkgdata) && !file.exists(file.path(sysdata, "eng.traineddata"))){
    dir.create(sysdir, showWarnings = FALSE, recursive = TRUE)
    if(file.exists(sysdir)){
      onload_notify()
      olddir <- getwd()
      on.exit(setwd(olddir))
      setwd(pkgdir)
      file.copy("tessdata", sysdir, recursive = TRUE)
    }
  }
  if(is.na(Sys.getenv("TESSDATA_PREFIX", NA))){
    if(file.exists(file.path(sysdata, "eng.traineddata"))){
      Sys.setenv(TESSDATA_PREFIX = sysdata)
    } else if(file.exists(file.path(pkgdata, "eng.traineddata"))){
      Sys.setenv(TESSDATA_PREFIX = pkgdata)
    }
  }

  if(grepl('tesseract.Rcheck', getwd(), fixed = TRUE)){
    Sys.setenv(OMP_THREAD_LIMIT=2)
    Sys.setenv(OMP_NUM_THREADS=2)
  }
}

tesseract_version_major <- function(){
  as.numeric(substring(tesseract_config()$version, 1, 1))
}

onload_notify <- function(){
  message("First use of Tesseract: copying language data...\n")
}

is_testload <- function(){
  as.logical(nchar(Sys.getenv("R_INSTALL_PKG")))
}

.onUnload <- function(lib){
  Sys.unsetenv("TESSDATA_PREFIX")
}

.onAttach <- function(lib, pkg){
  check_training_data()

  # Load tibble (if available) for pretty printing
  if(interactive() && is.null(.getNamespace('tibble'))){
    tryCatch({
      getNamespace('tibble')
    }, error= function(e){})
  }
}

check_training_data <- function(){
  tryCatch(tesseract(), error = function(e){
    warning("Unable to find English training data", call. = FALSE)
    os <- utils::sessionInfo()$running
    if(isTRUE(grepl("ubuntu|debian", os, TRUE))){
      stop("DEBIAN / UBUNTU: Please run: apt-get install tesseract-ocr-eng")
    }
  })
}


================================================
FILE: R/tessdata.R
================================================
#' Tesseract Training Data
#'
#' Helper function to download training data from the official
#' [tessdata](https://tesseract-ocr.github.io/tessdoc/Data-Files) repository. On Linux, the fast training data can be installed directly with
#' [yum](https://src.fedoraproject.org/rpms/tesseract) or
#' [apt-get](https://packages.debian.org/search?suite=stable&section=all&arch=any&searchon=names&keywords=tesseract-ocr-).
#'
#' Tesseract uses training data to perform OCR. Most systems default to English
#' training data. To improve OCR performance for other languages you can to install the
#' training data from your distribution. For example to install the spanish training data:
#'
#'  - [tesseract-ocr-spa](https://packages.debian.org/testing/tesseract-ocr-spa) (Debian, Ubuntu)
#'  - `tesseract-langpack-spa` (Fedora, EPEL)
#'
#' On Windows and MacOS you can install languages using the [tesseract_download] function
#' which downloads training data directly from [github](https://github.com/tesseract-ocr/tessdata)
#' and stores it in a the path on disk given by the `TESSDATA_PREFIX` variable.
#'
#' @export
#' @aliases tessdata
#' @rdname tessdata
#' @family tesseract
#' @param lang three letter code for language, see [tessdata](https://github.com/tesseract-ocr/tessdata) repository.
#' @param datapath destination directory where to download store the file
#' @param model either `fast` or `best` is currently supported. The latter downloads
#' more accurate (but slower) trained models for Tesseract 4.0 or higher
#' @param progress print progress while downloading
#' @references [tesseract wiki: training data](https://tesseract-ocr.github.io/tessdoc/Data-Files)
#' @examples \dontrun{
#' if(is.na(match("fra", tesseract_info()$available)))
#'   tesseract_download("fra", model = 'best')
#' french <- tesseract("fra")
#' text <- ocr("https://jeroen.github.io/images/french_text.png", engine = french)
#' cat(text)
#' }
tesseract_download <- function(lang, datapath = NULL, model = c("fast", "best"), progress = interactive()) {
  stopifnot(is.character(lang))
  model <- match.arg(model)
  if(!length(datapath)){
    warn_on_linux()
    datapath <- tesseract_info()$datapath
  }
  datapath <- normalizePath(datapath, mustWork = TRUE)
  version <- tesseract_version_major()

  if(version < 4){
    repo <- "tessdata"
    release <- "3.04.00"
  } else {
    repo <- paste0("tessdata_", model)
    release <- "4.1.0"
  }

  url <- sprintf("https://github.com/tesseract-ocr/%s/raw/%s/%s.traineddata", repo, release, lang)

  destfile <- file.path(datapath, basename(url))

  if (file.exists(destfile)) {
    message(paste("Training data already exists. Overwriting", destfile))
  }

  req <- curl::curl_fetch_memory(url, curl::new_handle(
    progressfunction = progress_fun,
    noprogress = !isTRUE(progress)
  ))
  if(progress)
    cat("\n")
  if(req$status_code != 200)
    stop("Download failed: HTTP ", req$status_code, call. = FALSE)

  writeBin(req$content, destfile)
  return(destfile)
}

progress_fun <- function(down, up) {
  total <- down[[1]]
  now <- down[[2]]
  pct <- if(length(total) && total > 0){
    paste0("(", round(now/total * 100), "%)")
  } else {
    ""
  }
  if(now > 10000)
    cat("\r Downloaded:", sprintf("%.2f", now / 2^20), "MB ", pct)
  TRUE
}

warn_on_linux <- function(){
  if(identical(.Platform$OS.type, "unix") && !identical(Sys.info()[["sysname"]], "Darwin")){
    warning("On Linux you should install training data via yum/apt. Please check the manual page.", call. = FALSE)
  }
}


================================================
FILE: R/tesseract.R
================================================
#' Tesseract Engine
#'
#' Create an OCR engine for a given language and control parameters. This can be used by
#' the [ocr] and [ocr_data] functions to recognize text.
#'
#' Tesseract control parameters can be set either via a named list in the
#' `options` parameter, or in a `config` file text file which contains the parameter name
#' followed by a space and then the value, one per line. Use [tesseract_params()] to list
#' or find parameters. Note that that some parameters are only supported in certain versions
#' of libtesseract, and that invalid parameters can sometimes cause libtesseract to crash.
#'
#' @export
#' @rdname tesseract
#' @family tesseract
#' @param language string with language for training data. Usually defaults to `eng`
#' @param datapath path with the training data for this language. Default uses
#' the system library.
#' @param configs character vector with files, each containing one or more parameter
#' values. These config files can exist in the current directory or one of the standard
#' tesseract config files that live in the tessdata directory. See details.
#' @param options a named list with tesseract parameters. See details.
#' @param cache speed things up by caching engines
tesseract <- local({
  store <- new.env()
  function(language = "eng", datapath = NULL, configs = NULL, options = NULL, cache = TRUE){
    datapath <- normalizePath(as.character(datapath), mustWork = TRUE)
    language <- as.character(language)
    configs <- as.character(configs)
    options <- as.list(options)
    if(isTRUE(cache)){
      key <- digest::digest(list(language, datapath, configs, options))
      if(is.null(store[[key]])){
        ptr <- tesseract_engine(datapath, language, configs, options)
        assign(key, ptr, store);
      }
      store[[key]]
    } else {
      tesseract_engine(datapath, language, configs, options)
    }
  }
})

#' @export
#' @rdname tesseract
#' @param filter only list parameters containing a particular string
#' @examples tesseract_params('debug')
tesseract_params <- function(filter = ""){
  tmp <- print_params(tempfile())
  on.exit(unlink(tmp))
  df <- parse_params(tmp)
  subset <- grepl(filter, paste(df$param, df$desc), ignore.case = TRUE)
  df_as_tibble(df[subset,])
}

#' @export
#' @rdname tesseract
tesseract_info <- function(){
  info <- engine_info_internal(tesseract())
  config <- tesseract_config()
  list(datapath = info$datapath,
       available = info$available,
       version = config$version,
       configs = list.files(file.path(info$datapath, "configs")))
}

parse_params <- function(path){
  utils::read.delim(path, header = FALSE, quote = "",
                    col.names = c("param", "default", "desc"), stringsAsFactors = FALSE)
}

tesseract_engine <- function(datapath, language, configs, options){

  # Tesseract::read_config_file first checks for local file, then in tessdata
  lapply(configs, function(confpath){
    if(file.exists(confpath)){
      params <- tryCatch(utils::read.table(confpath, quote = ""), error = function(e){
        bail("Failed to parse config file '%s': %s", confpath, e$message)
      })
      ok <- validate_params(params$V1)
      if(any(!ok))
        bail("Unsupported Tesseract parameter(s): [%s] in %s", paste(params$V1[!ok], collapse = ", "), confpath)
    }
  })

  opt_names <- as.character(names(options))
  opt_values <- as.character(options)
  ok <- validate_params(opt_names)
  if(any(!ok))
    bail("Unsupported Tesseract parameter(s): [%s]", paste(opt_names[!ok], collapse = ", "))

  tesseract_engine_internal(datapath, language, configs, opt_names, opt_values)
}

download_files <- function(urls){
  files <- vapply(urls, function(path){
    if(grepl("^https?://", path)){
      tmp <- tempfile(fileext =  basename(path))
      curl::curl_download(path, tmp)
      path <- tmp
    }
    normalizePath(path, mustWork = TRUE)
  }, character(1))
  is_pdf <- grepl(".pdf$", files)
  out <- unlist(lapply(files[is_pdf], function(path){
    pdftools::pdf_convert(path, dpi = 600)
  }))
  c(files[!is_pdf], out)
}

#' @export
"print.tesseract" <- function(x, ...){
  info <- engine_info_internal(x)
  cat("<tesseract engine>\n")
  cat(" loaded:", info$loaded, "\n")
  cat(" datapath:", info$datapath, "\n")
  cat(" available:", info$available, "\n")
}

bail <- function(...){
  stop(sprintf(...), call. = FALSE)
}

df_as_tibble <- function(df){
  stopifnot(is.data.frame(df))
  class(df) <- c("tbl_df", "tbl", "data.frame")
  df
}


================================================
FILE: README.md
================================================
# tesseract

> Bindings to [Tesseract-OCR](https://opensource.google/projects/tesseract): 
  a powerful optical character recognition (OCR) engine that supports over 100 languages.
  The engine is highly configurable in order to tune the detection algorithms and
  obtain the best possible results.

[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](http://www.repostatus.org/#active)
[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/tesseract)](https://cran.r-project.org/package=tesseract)
[![CRAN RStudio mirror downloads](http://cranlogs.r-pkg.org/badges/tesseract)](https://cran.r-project.org/package=tesseract)

 - Upstream Tesseract-OCR documentation: https://tesseract-ocr.github.io/tessdoc/
 - Introduction: https://docs.ropensci.org/tesseract/articles/intro.html
 - Reference: https://docs.ropensci.org/tesseract/reference/ocr.html

## Hello World

Simple example

```r
# Simple example
text <- ocr("https://jeroen.github.io/images/testocr.png")
cat(text)

# Get XML HOCR output
xml <- ocr("https://jeroen.github.io/images/testocr.png", HOCR = TRUE)
cat(xml)
```

Roundtrip test: render PDF to image and OCR it back to text

```r
# Full roundtrip test: render PDF to image and OCR it back to text
curl::curl_download("https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf", "R-intro.pdf")
orig <- pdftools::pdf_text("R-intro.pdf")[1]

# Render pdf to png image
img_file <- pdftools::pdf_convert("R-intro.pdf", format = 'tiff', pages = 1, dpi = 400)

# Extract text from png image
text <- ocr(img_file)
unlink(img_file)
cat(text)
```

## Installation

On Windows and MacOS the package binary package can be installed from CRAN:

```r
install.packages("tesseract")
```

Installation from source on Linux or OSX requires the `Tesseract` library (see below).

### Install from source

 On __Debian__ or __Ubuntu__ install [libtesseract-dev](https://packages.debian.org/testing/libtesseract-dev) and
[libleptonica-dev](https://packages.debian.org/testing/libleptonica-dev). Also install [tesseract-ocr-eng](https://packages.debian.org/testing/tesseract-ocr-eng) to run examples.

```
sudo apt-get install -y libtesseract-dev libleptonica-dev tesseract-ocr-eng
```

On __Ubuntu__ you can optionally use [this PPA](https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr-devel) to get the latest version of Tesseract:

```
sudo add-apt-repository ppa:alex-p/tesseract-ocr-devel
sudo apt-get install -y libtesseract-dev tesseract-ocr-eng
```

On __Fedora__ we need [tesseract-devel](https://src.fedoraproject.org/rpms/tesseract) and
[leptonica-devel](https://src.fedoraproject.org/rpms/leptonica)

```
sudo yum install tesseract-devel leptonica-devel
````

On __RHEL__ and __CentOS__ we need [tesseract-devel](https://src.fedoraproject.org/rpms/tesseract) and
[leptonica-devel](https://src.fedoraproject.org/rpms/leptonica) from EPEL

```
sudo yum install epel-release
sudo yum install tesseract-devel leptonica-devel
````


On __OS-X__ use [tesseract](https://github.com/Homebrew/homebrew-core/blob/master/Formula/tesseract.rb) from Homebrew:

```
brew install tesseract
```

Tesseract uses training data to perform OCR. Most systems default to English
training data. To improve OCR results for other languages you can to install the
appropriate training data. On Windows and OSX you can do this in R using 
`tesseract_download()`:


```r
tesseract_download('fra')
```

On Linux you need to install the appropriate training data from your distribution. 
For example to install the spanish training data:

  - [tesseract-ocr-spa](https://packages.debian.org/testing/tesseract-ocr-spa) (Debian, Ubuntu)
  - [tesseract-langpack-spa](https://src.fedoraproject.org/rpms/tesseract-langpack) (Fedora, EPEL)

Alternatively you can manually download training data from [github](https://github.com/tesseract-ocr/tessdata)
and store it in a path on disk that you pass in the `datapath` parameter or set a default path via the
`TESSDATA_PREFIX` environment variable. Note that the Tesseract 4 and Tesseract 3 use different 
training data format. Make sure to download training data from the branch that matches your libtesseract version.


================================================
FILE: cleanup
================================================
#!/bin/sh
rm -f src/Makevars configure.log autobrew


================================================
FILE: configure
================================================
# Anticonf (tm) script by Jeroen Ooms (2022)
# This script will query 'pkg-config' for the required cflags and ldflags.
# If pkg-config is unavailable or does not find the library, try setting
# INCLUDE_DIR and LIB_DIR manually via e.g:
# R CMD INSTALL --configure-vars='INCLUDE_DIR=/.../include LIB_DIR=/.../lib'

# Library settings
PKG_CONFIG_NAME="tesseract"
PKG_DEB_NAME="libtesseract-dev libleptonica-dev"
PKG_RPM_NAME="tesseract-devel leptonica-devel"
PKG_BREW_NAME="tesseract"
PKG_TEST_HEADER="<baseapi.h>"
PKG_CFLAGS="-I/usr/include/tesseract -I/usr/include/leptonica"
PKG_LIBS="-ltesseract"

# Use pkg-config if available
pkg-config --version >/dev/null 2>&1
if [ $? -eq 0 ]; then
  PKGCONFIG_CFLAGS=`pkg-config --cflags --silence-errors ${PKG_CONFIG_NAME}`
  PKGCONFIG_LIBS=`pkg-config --libs ${PKG_CONFIG_NAME}`
fi
# Note that cflags may be empty in case of success
if [ "$INCLUDE_DIR" ] || [ "$LIB_DIR" ]; then
  echo "Found INCLUDE_DIR and/or LIB_DIR!"
  PKG_CFLAGS="-I$INCLUDE_DIR $PKG_CFLAGS"
  PKG_LIBS="-L$LIB_DIR $PKG_LIBS"
elif [ "$PKGCONFIG_CFLAGS" ] || [ "$PKGCONFIG_LIBS" ]; then
  echo "Found pkg-config cflags and libs!"
  PKG_CFLAGS=${PKGCONFIG_CFLAGS}
  PKG_LIBS=${PKGCONFIG_LIBS}
elif [ `uname` = "Darwin" ]; then
  test ! "$CI" && brew --version 2>/dev/null
  if [ $? -eq 0 ]; then
    BREWDIR=`brew --prefix`
    PKG_CFLAGS="-I$BREWDIR/include/tesseract -I$BREWDIR/include/leptonica"
    PKG_LIBS="-L$BREWDIR/lib $PKG_LIBS"
  else
    curl -sfL "https://autobrew.github.io/scripts/tesseract" > autobrew
    . ./autobrew
  fi
fi

# For debugging
echo "Using PKG_CFLAGS=$PKG_CFLAGS"
echo "Using PKG_LIBS=$PKG_LIBS"

# Find compiler
CXX=`${R_HOME}/bin/R CMD config CXX`
CPPFLAGS=`${R_HOME}/bin/R CMD config CPPFLAGS`

# Test configuration
echo "Using CXX: ${CXX}"
${CXX} -E ${CPPFLAGS} ${PKG_CFLAGS} tools/test.cpp >/dev/null 2>configure.log

# Customize the error
if [ $? -ne 0 ]; then
  echo "--------------------------- [ANTICONF] --------------------------------"
  echo "Configuration failed to find '$PKG_CONFIG_NAME' system library. Try installing:"
  echo " * deb: $PKG_DEB_NAME (Debian, Ubuntu, etc)"
  echo " * rpm: $PKG_RPM_NAME (Fedora, CentOS, RHEL)"
  echo " * brew: $PKG_BREW_NAME (Mac OSX)"
  echo "If $PKG_CONFIG_NAME is already installed, check that 'pkg-config' is in your"
  echo "PATH and PKG_CONFIG_PATH contains a $PKG_CONFIG_NAME.pc file. If pkg-config"
  echo "is unavailable you can set INCLUDE_DIR and LIB_DIR manually via:"
  echo "R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'"
  echo "-------------------------- [ERROR MESSAGE] ---------------------------"
  cat configure.log
  echo "--------------------------------------------------------------------"
  exit 1
fi

# Write to Makevars
sed -e "s|@cflags@|$PKG_CFLAGS|" -e "s|@libs@|$PKG_LIBS|" src/Makevars.in > src/Makevars

# Success
exit 0


================================================
FILE: configure.win
================================================


================================================
FILE: inst/AUTHORS
================================================
Authors of upstream tesseract library and training data:

Ray Smith (lead developer)
Ahmad Abdulkader
Rika Antonova
Nicholas Beato
Jeff Breidenbach
Samuel Charron
Phil Cheatle
Simon Crouch
David Eger
Sheelagh Huddleston
Dan Johnson
Rajesh Katikam
Thomas Kielbus
Dar-Shyang Lee
Zongyi (Joe) Liu
Robert Moss
Chris Newton
Michael Reimer
Marius Renn
Raquel Romano
Christy Russon
Shobhit Saxena
Mark Seaman
Faisal Shafait
Hiroshi Takenaka
Ranjith Unnikrishnan
Joern Wanke
Ping Ping Xiu
Andrew Ziem
Oscar Zuniga

Community Contributors:
Zdenko Podobný (Maintainer)
Jim Regan (Maintainer)
James R Barlow
Amit Dovev
Martin Ettl
Tom Morris
Tobias Müller
Egor Pugin
Sundar M. Vaidya
Stefan Weil


================================================
FILE: inst/COPYRIGHT
================================================
The package includes machine-generated training data which is released
by Tesseract developers under Apache 2.0 license. Both data and license
are available from: https://github.com/tesseract-ocr/tessdata


================================================
FILE: inst/WORDLIST
================================================
config
EPEL
github
greyscale
hOCR
HOCR
https
jpeg
knitr
langpack
libtesseract
MacOS
magick
Magick
Nederlands
ocr
opensource
pdftools
png
rmarkdown
spanish
tessdata
toc
utrecht
VignetteEncoding
VignetteEngine
VignetteIndexEntry


================================================
FILE: man/ocr.Rd
================================================
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/ocr.R
\name{ocr}
\alias{ocr}
\alias{ocr_data}
\title{Tesseract OCR}
\usage{
ocr(image, engine = tesseract("eng"), HOCR = FALSE)

ocr_data(image, engine = tesseract("eng"))
}
\arguments{
\item{image}{file path, url, or raw vector to image (png, tiff, jpeg, etc)}

\item{engine}{a tesseract engine created with \code{\link[=tesseract]{tesseract()}}. Alternatively a
language string which will be passed to \code{\link[=tesseract]{tesseract()}}.}

\item{HOCR}{if \code{TRUE} return results as HOCR xml instead of plain text}
}
\description{
Extract text from an image. Requires that you have training data for the language you
are reading. Works best for images with high contrast, little noise and horizontal text.
See \href{https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality}{tesseract wiki} and
our package vignette for image preprocessing tips.
}
\details{
The \code{ocr()} function returns plain text by default, or hOCR text if hOCR is set to \code{TRUE}.
The \code{ocr_data()} function returns a data frame with a confidence rate and bounding box for
each word in the text.
}
\examples{
\donttest{
text <- ocr("https://jeroen.github.io/images/testocr.png")
cat(text)

xml <- ocr("https://jeroen.github.io/images/testocr.png", HOCR = TRUE)
cat(xml)

df <- ocr_data("https://jeroen.github.io/images/testocr.png")
print(df)

# Full roundtrip test: render PDF to image and OCR it back to text
curl::curl_download("https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf", "R-intro.pdf")
orig <- pdftools::pdf_text("R-intro.pdf")[1]

# Render pdf to png image
img_file <- pdftools::pdf_convert("R-intro.pdf", format = 'tiff', pages = 1, dpi = 400)
unlink("R-intro.pdf")

# Extract text from png image
text <- ocr(img_file)
unlink(img_file)
cat(text)
}

engine <- tesseract(options = list(tessedit_char_whitelist = "0123456789"))
}
\references{
\href{https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality}{Tesseract: Improving Quality}
}
\seealso{
Other tesseract: 
\code{\link{tesseract}()},
\code{\link{tesseract_download}()}
}
\concept{tesseract}


================================================
FILE: man/tessdata.Rd
================================================
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/tessdata.R
\name{tesseract_download}
\alias{tesseract_download}
\alias{tessdata}
\title{Tesseract Training Data}
\usage{
tesseract_download(
  lang,
  datapath = NULL,
  model = c("fast", "best"),
  progress = interactive()
)
}
\arguments{
\item{lang}{three letter code for language, see \href{https://github.com/tesseract-ocr/tessdata}{tessdata} repository.}

\item{datapath}{destination directory where to download store the file}

\item{model}{either \code{fast} or \code{best} is currently supported. The latter downloads
more accurate (but slower) trained models for Tesseract 4.0 or higher}

\item{progress}{print progress while downloading}
}
\description{
Helper function to download training data from the official
\href{https://tesseract-ocr.github.io/tessdoc/Data-Files}{tessdata} repository. On Linux, the fast training data can be installed directly with
\href{https://src.fedoraproject.org/rpms/tesseract}{yum} or
\href{https://packages.debian.org/search?suite=stable&section=all&arch=any&searchon=names&keywords=tesseract-ocr-}{apt-get}.
}
\details{
Tesseract uses training data to perform OCR. Most systems default to English
training data. To improve OCR performance for other languages you can to install the
training data from your distribution. For example to install the spanish training data:
\itemize{
\item \href{https://packages.debian.org/testing/tesseract-ocr-spa}{tesseract-ocr-spa} (Debian, Ubuntu)
\item \code{tesseract-langpack-spa} (Fedora, EPEL)
}

On Windows and MacOS you can install languages using the \link{tesseract_download} function
which downloads training data directly from \href{https://github.com/tesseract-ocr/tessdata}{github}
and stores it in a the path on disk given by the \code{TESSDATA_PREFIX} variable.
}
\examples{
\dontrun{
if(is.na(match("fra", tesseract_info()$available)))
  tesseract_download("fra", model = 'best')
french <- tesseract("fra")
text <- ocr("https://jeroen.github.io/images/french_text.png", engine = french)
cat(text)
}
}
\references{
\href{https://tesseract-ocr.github.io/tessdoc/Data-Files}{tesseract wiki: training data}
}
\seealso{
Other tesseract: 
\code{\link{ocr}()},
\code{\link{tesseract}()}
}
\concept{tesseract}


================================================
FILE: man/tesseract.Rd
================================================
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/tesseract.R
\name{tesseract}
\alias{tesseract}
\alias{tesseract_params}
\alias{tesseract_info}
\title{Tesseract Engine}
\usage{
tesseract(
  language = "eng",
  datapath = NULL,
  configs = NULL,
  options = NULL,
  cache = TRUE
)

tesseract_params(filter = "")

tesseract_info()
}
\arguments{
\item{language}{string with language for training data. Usually defaults to \code{eng}}

\item{datapath}{path with the training data for this language. Default uses
the system library.}

\item{configs}{character vector with files, each containing one or more parameter
values. These config files can exist in the current directory or one of the standard
tesseract config files that live in the tessdata directory. See details.}

\item{options}{a named list with tesseract parameters. See details.}

\item{cache}{speed things up by caching engines}

\item{filter}{only list parameters containing a particular string}
}
\description{
Create an OCR engine for a given language and control parameters. This can be used by
the \link{ocr} and \link{ocr_data} functions to recognize text.
}
\details{
Tesseract control parameters can be set either via a named list in the
\code{options} parameter, or in a \code{config} file text file which contains the parameter name
followed by a space and then the value, one per line. Use \code{\link[=tesseract_params]{tesseract_params()}} to list
or find parameters. Note that that some parameters are only supported in certain versions
of libtesseract, and that invalid parameters can sometimes cause libtesseract to crash.
}
\examples{
tesseract_params('debug')
}
\seealso{
Other tesseract: 
\code{\link{ocr}()},
\code{\link{tesseract_download}()}
}
\concept{tesseract}


================================================
FILE: src/Makevars.in
================================================
PKG_CPPFLAGS=@cflags@
PKG_LIBS=@libs@

PKG_CXXFLAGS=$(CXX_VISIBILITY)

all: $(SHLIB) cleanup

cleanup: $(SHLIB)
	@rm -Rf ../.deps


================================================
FILE: src/Makevars.win
================================================
RWINLIB = ../.deps/tesseract
PKG_CPPFLAGS = -I$(RWINLIB)/include -I$(RWINLIB)/include/leptonica

PKG_LIBS = \
	-L$(RWINLIB)/lib${subst gcc,,${COMPILED_BY}}${R_ARCH} \
	-L$(RWINLIB)/lib \
	-ltesseract -lleptonica \
	-ltiff -lopenjp2 -lwebp -lsharpyuv -ljpeg -lgif -lpng16 -lz \
	-lws2_32

all: $(SHLIB) cleanup

# Needed for parallel make
$(OBJECTS): | $(RWINLIB)

$(RWINLIB):
	@"${R_HOME}/bin${R_ARCH_BIN}/Rscript.exe" "../tools/winlibs.R"

cleanup: $(SHLIB)
	@rm -Rf $(RWINLIB)


================================================
FILE: src/RcppExports.cpp
================================================
// Generated by using Rcpp::compileAttributes() -> do not edit by hand
// Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393

#include "tesseract_types.h"
#include <Rcpp.h>

using namespace Rcpp;

#ifdef RCPP_USE_GLOBAL_ROSTREAM
Rcpp::Rostream<true>&  Rcpp::Rcout = Rcpp::Rcpp_cout_get();
Rcpp::Rostream<false>& Rcpp::Rcerr = Rcpp::Rcpp_cerr_get();
#endif

// tesseract_config
Rcpp::List tesseract_config();
RcppExport SEXP _tesseract_tesseract_config() {
BEGIN_RCPP
    Rcpp::RObject rcpp_result_gen;
    Rcpp::RNGScope rcpp_rngScope_gen;
    rcpp_result_gen = Rcpp::wrap(tesseract_config());
    return rcpp_result_gen;
END_RCPP
}
// tesseract_engine_internal
TessPtr tesseract_engine_internal(Rcpp::CharacterVector datapath, Rcpp::CharacterVector language, Rcpp::CharacterVector confpaths, Rcpp::CharacterVector opt_names, Rcpp::CharacterVector opt_values);
RcppExport SEXP _tesseract_tesseract_engine_internal(SEXP datapathSEXP, SEXP languageSEXP, SEXP confpathsSEXP, SEXP opt_namesSEXP, SEXP opt_valuesSEXP) {
BEGIN_RCPP
    Rcpp::RObject rcpp_result_gen;
    Rcpp::RNGScope rcpp_rngScope_gen;
    Rcpp::traits::input_parameter< Rcpp::CharacterVector >::type datapath(datapathSEXP);
    Rcpp::traits::input_parameter< Rcpp::CharacterVector >::type language(languageSEXP);
    Rcpp::traits::input_parameter< Rcpp::CharacterVector >::type confpaths(confpathsSEXP);
    Rcpp::traits::input_parameter< Rcpp::CharacterVector >::type opt_names(opt_namesSEXP);
    Rcpp::traits::input_parameter< Rcpp::CharacterVector >::type opt_values(opt_valuesSEXP);
    rcpp_result_gen = Rcpp::wrap(tesseract_engine_internal(datapath, language, confpaths, opt_names, opt_values));
    return rcpp_result_gen;
END_RCPP
}
// tesseract_engine_set_variable
TessPtr tesseract_engine_set_variable(TessPtr ptr, const char * name, const char * value);
RcppExport SEXP _tesseract_tesseract_engine_set_variable(SEXP ptrSEXP, SEXP nameSEXP, SEXP valueSEXP) {
BEGIN_RCPP
    Rcpp::RObject rcpp_result_gen;
    Rcpp::RNGScope rcpp_rngScope_gen;
    Rcpp::traits::input_parameter< TessPtr >::type ptr(ptrSEXP);
    Rcpp::traits::input_parameter< const char * >::type name(nameSEXP);
    Rcpp::traits::input_parameter< const char * >::type value(valueSEXP);
    rcpp_result_gen = Rcpp::wrap(tesseract_engine_set_variable(ptr, name, value));
    return rcpp_result_gen;
END_RCPP
}
// validate_params
Rcpp::LogicalVector validate_params(Rcpp::CharacterVector params);
RcppExport SEXP _tesseract_validate_params(SEXP paramsSEXP) {
BEGIN_RCPP
    Rcpp::RObject rcpp_result_gen;
    Rcpp::RNGScope rcpp_rngScope_gen;
    Rcpp::traits::input_parameter< Rcpp::CharacterVector >::type params(paramsSEXP);
    rcpp_result_gen = Rcpp::wrap(validate_params(params));
    return rcpp_result_gen;
END_RCPP
}
// engine_info_internal
Rcpp::List engine_info_internal(TessPtr ptr);
RcppExport SEXP _tesseract_engine_info_internal(SEXP ptrSEXP) {
BEGIN_RCPP
    Rcpp::RObject rcpp_result_gen;
    Rcpp::RNGScope rcpp_rngScope_gen;
    Rcpp::traits::input_parameter< TessPtr >::type ptr(ptrSEXP);
    rcpp_result_gen = Rcpp::wrap(engine_info_internal(ptr));
    return rcpp_result_gen;
END_RCPP
}
// print_params
Rcpp::String print_params(std::string filename);
RcppExport SEXP _tesseract_print_params(SEXP filenameSEXP) {
BEGIN_RCPP
    Rcpp::RObject rcpp_result_gen;
    Rcpp::RNGScope rcpp_rngScope_gen;
    Rcpp::traits::input_parameter< std::string >::type filename(filenameSEXP);
    rcpp_result_gen = Rcpp::wrap(print_params(filename));
    return rcpp_result_gen;
END_RCPP
}
// get_param_values
Rcpp::CharacterVector get_param_values(TessPtr ptr, Rcpp::CharacterVector params);
RcppExport SEXP _tesseract_get_param_values(SEXP ptrSEXP, SEXP paramsSEXP) {
BEGIN_RCPP
    Rcpp::RObject rcpp_result_gen;
    Rcpp::RNGScope rcpp_rngScope_gen;
    Rcpp::traits::input_parameter< TessPtr >::type ptr(ptrSEXP);
    Rcpp::traits::input_parameter< Rcpp::CharacterVector >::type params(paramsSEXP);
    rcpp_result_gen = Rcpp::wrap(get_param_values(ptr, params));
    return rcpp_result_gen;
END_RCPP
}
// ocr_raw
Rcpp::String ocr_raw(Rcpp::RawVector input, TessPtr ptr, bool HOCR);
RcppExport SEXP _tesseract_ocr_raw(SEXP inputSEXP, SEXP ptrSEXP, SEXP HOCRSEXP) {
BEGIN_RCPP
    Rcpp::RObject rcpp_result_gen;
    Rcpp::RNGScope rcpp_rngScope_gen;
    Rcpp::traits::input_parameter< Rcpp::RawVector >::type input(inputSEXP);
    Rcpp::traits::input_parameter< TessPtr >::type ptr(ptrSEXP);
    Rcpp::traits::input_parameter< bool >::type HOCR(HOCRSEXP);
    rcpp_result_gen = Rcpp::wrap(ocr_raw(input, ptr, HOCR));
    return rcpp_result_gen;
END_RCPP
}
// ocr_file
Rcpp::String ocr_file(std::string file, TessPtr ptr, bool HOCR);
RcppExport SEXP _tesseract_ocr_file(SEXP fileSEXP, SEXP ptrSEXP, SEXP HOCRSEXP) {
BEGIN_RCPP
    Rcpp::RObject rcpp_result_gen;
    Rcpp::RNGScope rcpp_rngScope_gen;
    Rcpp::traits::input_parameter< std::string >::type file(fileSEXP);
    Rcpp::traits::input_parameter< TessPtr >::type ptr(ptrSEXP);
    Rcpp::traits::input_parameter< bool >::type HOCR(HOCRSEXP);
    rcpp_result_gen = Rcpp::wrap(ocr_file(file, ptr, HOCR));
    return rcpp_result_gen;
END_RCPP
}
// ocr_raw_data
Rcpp::DataFrame ocr_raw_data(Rcpp::RawVector input, TessPtr ptr);
RcppExport SEXP _tesseract_ocr_raw_data(SEXP inputSEXP, SEXP ptrSEXP) {
BEGIN_RCPP
    Rcpp::RObject rcpp_result_gen;
    Rcpp::RNGScope rcpp_rngScope_gen;
    Rcpp::traits::input_parameter< Rcpp::RawVector >::type input(inputSEXP);
    Rcpp::traits::input_parameter< TessPtr >::type ptr(ptrSEXP);
    rcpp_result_gen = Rcpp::wrap(ocr_raw_data(input, ptr));
    return rcpp_result_gen;
END_RCPP
}
// ocr_file_data
Rcpp::DataFrame ocr_file_data(std::string file, TessPtr ptr);
RcppExport SEXP _tesseract_ocr_file_data(SEXP fileSEXP, SEXP ptrSEXP) {
BEGIN_RCPP
    Rcpp::RObject rcpp_result_gen;
    Rcpp::RNGScope rcpp_rngScope_gen;
    Rcpp::traits::input_parameter< std::string >::type file(fileSEXP);
    Rcpp::traits::input_parameter< TessPtr >::type ptr(ptrSEXP);
    rcpp_result_gen = Rcpp::wrap(ocr_file_data(file, ptr));
    return rcpp_result_gen;
END_RCPP
}

static const R_CallMethodDef CallEntries[] = {
    {"_tesseract_tesseract_config", (DL_FUNC) &_tesseract_tesseract_config, 0},
    {"_tesseract_tesseract_engine_internal", (DL_FUNC) &_tesseract_tesseract_engine_internal, 5},
    {"_tesseract_tesseract_engine_set_variable", (DL_FUNC) &_tesseract_tesseract_engine_set_variable, 3},
    {"_tesseract_validate_params", (DL_FUNC) &_tesseract_validate_params, 1},
    {"_tesseract_engine_info_internal", (DL_FUNC) &_tesseract_engine_info_internal, 1},
    {"_tesseract_print_params", (DL_FUNC) &_tesseract_print_params, 1},
    {"_tesseract_get_param_values", (DL_FUNC) &_tesseract_get_param_values, 2},
    {"_tesseract_ocr_raw", (DL_FUNC) &_tesseract_ocr_raw, 3},
    {"_tesseract_ocr_file", (DL_FUNC) &_tesseract_ocr_file, 3},
    {"_tesseract_ocr_raw_data", (DL_FUNC) &_tesseract_ocr_raw_data, 2},
    {"_tesseract_ocr_file_data", (DL_FUNC) &_tesseract_ocr_file_data, 2},
    {NULL, NULL, 0}
};

RcppExport void R_init_tesseract(DllInfo *dll) {
    R_registerRoutines(dll, NULL, CallEntries, NULL, NULL);
    R_useDynamicSymbols(dll, FALSE);
}


================================================
FILE: src/tesseract.cpp
================================================
#include "tesseract_types.h"
#if TESSERACT_MAJOR_VERSION < 5
#include <tesseract/genericvector.h>
#define getorat get
#else
#define STRING std::string
#define GenericVector std::vector
#define getorat at
#endif

/* libtesseract 4.0 insisted that the engine is initiated in 'C' locale.
 * We do this as exemplified in the example code in the libc manual:
 * https://www.gnu.org/software/libc/manual/html_node/Setting-the-Locale.html
 * Full discussion: https://github.com/tesseract-ocr/tesseract/issues/1670
 */
#if TESSERACT_MAJOR_VERSION == 4 && TESSERACT_MINOR_VERSION == 0
#define TESSERACT40
#endif

static tesseract::TessBaseAPI *make_analyze_api(){
#ifdef TESSERACT40
  char *old_ctype = strdup(setlocale(LC_ALL, NULL));
  setlocale(LC_ALL, "C");
#endif
  tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
  api->InitForAnalysePage();
#ifdef TESSERACT40
  setlocale(LC_ALL, old_ctype);
  free(old_ctype);
#endif
  return api;
}

// [[Rcpp::export]]
Rcpp::List tesseract_config(){
  tesseract::TessBaseAPI *api = make_analyze_api();
  Rcpp::List out = Rcpp::List::create(
    Rcpp::_["version"] = tesseract::TessBaseAPI::Version(),
    Rcpp::_["path"] = api->GetDatapath()
  );
  api->End();
  delete api;
  return out;
}

// [[Rcpp::export]]
TessPtr tesseract_engine_internal(Rcpp::CharacterVector datapath, Rcpp::CharacterVector language, Rcpp::CharacterVector confpaths,
                                  Rcpp::CharacterVector opt_names, Rcpp::CharacterVector opt_values){
  GenericVector<STRING> params, values;
  const char * path = NULL;
  const char * lang = NULL;
  char * configs[1000] = {0};
  if(datapath.length())
    path = datapath.at(0);
  if(language.length())
    lang = language.at(0);
  for(int i = 0; i < confpaths.length(); i++)
    configs[i] = confpaths.at(i);
  for(int i = 0; i < opt_names.length(); i++){
    params.push_back(std::string(opt_names.at(i)).c_str());
    values.push_back(std::string(opt_values.at(i)).c_str());
  }
#ifdef TESSERACT40
  char *old_ctype = strdup(setlocale(LC_ALL, NULL));
  setlocale(LC_ALL, "C");
#endif
  tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
  int err = api->Init(path, lang, tesseract::OEM_DEFAULT, configs, confpaths.length(), &params, &values, false);
#ifdef TESSERACT40
  setlocale(LC_ALL, old_ctype);
  free(old_ctype);
#endif
  if(err){
    delete api;
    throw std::runtime_error(std::string("Unable to find training data for: ") + (lang ? lang : "eng") + ". Please consult manual for: ?tesseract_download");
  }
  TessPtr ptr(api);
  ptr.attr("class") = Rcpp::CharacterVector::create("tesseract");
  return ptr;
}

tesseract::TessBaseAPI * get_engine(TessPtr engine){
  tesseract::TessBaseAPI * api = engine.get();
  if(api == NULL)
    throw std::runtime_error("pointer is dead");
  return api;
}

// [[Rcpp::export]]
TessPtr tesseract_engine_set_variable(TessPtr ptr, const char * name, const char * value){
  tesseract::TessBaseAPI * api = get_engine(ptr);
  if(!api->SetVariable(name, value))
    throw std::runtime_error(std::string("Failed to set variable ") + name);
  return ptr;
}

// [[Rcpp::export]]
Rcpp::LogicalVector validate_params(Rcpp::CharacterVector params){
  STRING str;
  tesseract::TessBaseAPI *api = make_analyze_api();
  Rcpp::LogicalVector out(params.length());
  for(int i = 0; i < params.length(); i++)
    out[i] = api->GetVariableAsString(params.at(i), &str);
  api->End();
  delete api;
  return out;
}

// [[Rcpp::export]]
Rcpp::List engine_info_internal(TessPtr ptr){
  tesseract::TessBaseAPI * api = get_engine(ptr);
  GenericVector<STRING> langs;
  api->GetAvailableLanguagesAsVector(&langs);
  Rcpp::CharacterVector available = Rcpp::CharacterVector::create();
  for (size_t i = 0; i < langs.size(); i++)
    available.push_back(langs.getorat(i).c_str());
  langs.clear();
  api->GetLoadedLanguagesAsVector(&langs);
  Rcpp::CharacterVector loaded = Rcpp::CharacterVector::create();
  for (size_t i = 0; i < langs.size(); i++)
    loaded.push_back(langs.getorat(i).c_str());
  return Rcpp::List::create(
    Rcpp::_["datapath"] = api->GetDatapath(),
    Rcpp::_["loaded"] = loaded,
    Rcpp::_["available"] = available
  );
}

// [[Rcpp::export]]
Rcpp::String print_params(std::string filename){
  tesseract::TessBaseAPI *api = make_analyze_api();
  FILE * fp = fopen(filename.c_str(), "w");
  api->PrintVariables(fp);
  fclose(fp);
  api->End();
  delete api;
  return filename;
}

// [[Rcpp::export]]
Rcpp::CharacterVector get_param_values(TessPtr ptr, Rcpp::CharacterVector params){
  STRING str;
  tesseract::TessBaseAPI * api = get_engine(ptr);
  Rcpp::CharacterVector out(params.length());
  for(int i = 0; i < params.length(); i++)
    out[i] = api->GetVariableAsString(params.at(i), &str) ? Rcpp::String(str.c_str()) : NA_STRING;
  return out;
}

Rcpp::String ocr_pix(tesseract::TessBaseAPI * api, Pix * image, bool HOCR){
  // Get OCR result
  api->ClearAdaptiveClassifier();
  api->SetImage(image);

  // Workaround for annoying warning, see https://github.com/tesseract-ocr/tesseract/issues/756
  if(api->GetSourceYResolution() < 70)
    api->SetSourceResolution(300);
  char *outText = HOCR ? api->GetHOCRText(0) : api->GetUTF8Text();

  //cleanup
  pixDestroy(&image);
  api->Clear();

  // Destroy used object and release memory
  Rcpp::String y(outText);
  y.set_encoding(CE_UTF8);
  delete [] outText;
  return y;
}

// [[Rcpp::export]]
Rcpp::String ocr_raw(Rcpp::RawVector input, TessPtr ptr, bool HOCR = false){
    tesseract::TessBaseAPI *api = get_engine(ptr);
    Pix *image =  pixReadMem(input.begin(), input.length());
    if(!image)
      throw std::runtime_error("Failed to read image");
    return ocr_pix(api, image, HOCR);
}

// [[Rcpp::export]]
Rcpp::String ocr_file(std::string file, TessPtr ptr, bool HOCR = false){
  tesseract::TessBaseAPI *api = get_engine(ptr);
  Pix *image =  pixRead(file.c_str());
  if(!image)
    throw std::runtime_error("Failed to read image");
  return ocr_pix(api, image, HOCR);
}

Rcpp::DataFrame ocr_data_internal(tesseract::TessBaseAPI * api, Pix * image){
  api->ClearAdaptiveClassifier();
  api->SetImage(image);
  if(api->GetSourceYResolution() < 70)
    api->SetSourceResolution(300);
  api->Recognize(0);
  tesseract::ResultIterator* ri = api->GetIterator();
  tesseract::PageIteratorLevel level = tesseract::RIL_WORD;
  size_t n = 0;
  std::list<std::string> words;
  std::list<std::string> bbox;
  std::list<float> conf;
  char buf[100];
  if (ri) {
    do {
      const char * word = ri->GetUTF8Text(level);
      if(!word)
        continue;
      words.push_back(word);
      conf.push_back(ri->Confidence(level));
      int x1, y1, x2, y2;
      ri->BoundingBox(level, &x1, &y1, &x2, &y2);
      snprintf(buf, 100, "%d,%d,%d,%d", x1, y1, x2, y2);
      bbox.push_back(buf);
      delete[] word;
      n++;
    } while (ri->Next(level));
  }
  Rcpp::CharacterVector rwords(n);
  Rcpp::CharacterVector rbbox(n);
  Rcpp::NumericVector rconf(n);
  for(size_t i = 0; i < n; i++) {
    rwords[i] = words.front(); words.pop_front();
    rbbox[i] = bbox.front(); bbox.pop_front();
    rconf[i] = conf.front(); conf.pop_front();
  }

  //cleanup
  pixDestroy(&image);
  api->Clear();
  delete ri;

  return Rcpp::DataFrame::create(
    Rcpp::_["word"] = rwords,
    Rcpp::_["confidence"] = rconf,
    Rcpp::_["bbox"] = rbbox,
    Rcpp::_["stringsAsFactors"] = false
  );
}

// [[Rcpp::export]]
Rcpp::DataFrame ocr_raw_data(Rcpp::RawVector input, TessPtr ptr){
  tesseract::TessBaseAPI *api = get_engine(ptr);
  Pix *image =  pixReadMem(input.begin(), input.length());
  if(!image)
    throw std::runtime_error("Failed to read image");
  return ocr_data_internal(api, image);
}

// [[Rcpp::export]]
Rcpp::DataFrame ocr_file_data(std::string file, TessPtr ptr){
  tesseract::TessBaseAPI *api = get_engine(ptr);
  Pix *image =  pixRead(file.c_str());
  if(!image)
    throw std::runtime_error("Failed to read image");
  return ocr_data_internal(api, image);
}


================================================
FILE: src/tesseract_types.h
================================================
#include <tesseract/baseapi.h>
#include <allheaders.h>

#define R_NO_REMAP
#define STRICT_R_HEADERS

#include <Rcpp.h>

inline void tess_finalizer(tesseract::TessBaseAPI *engine) {
  engine->End();
  delete engine;
}

typedef Rcpp::XPtr<tesseract::TessBaseAPI, Rcpp::PreserveStorage, tess_finalizer, true> TessPtr;


================================================
FILE: tesseract.Rproj
================================================
Version: 1.0
ProjectId: 953b2ed1-ac9d-4be8-984d-c26d5c642f38

RestoreWorkspace: Default
SaveWorkspace: Default
AlwaysSaveHistory: Default

EnableCodeIndexing: Yes
UseSpacesForTab: Yes
NumSpacesForTab: 2
Encoding: UTF-8

RnwWeave: Sweave
LaTeX: pdfLaTeX

AutoAppendNewline: Yes
StripTrailingWhitespace: Yes

BuildType: Package
PackageUseDevtools: Yes
PackageInstallArgs: --no-multiarch --with-keep.source
PackageRoxygenize: rd,namespace


================================================
FILE: tests/spelling.R
================================================
spelling::spell_check_test(vignettes = TRUE, error = FALSE)


================================================
FILE: tools/test.cpp
================================================
#include <tesseract/baseapi.h>
#include <allheaders.h>


================================================
FILE: tools/winlibs.R
================================================
if(!file.exists('tesseract.o') && !file.exists("../.deps/tesseract/include/tesseract/baseapi.h")){
  unlink("../.deps", recursive = TRUE)
  url <- if(grepl("aarch", R.version$platform)){
    "https://github.com/r-windows/bundles/releases/download/tesseract-5.3.2/tesseract-ocr-5.3.2-clang-aarch64.tar.xz"
  } else if(grepl("clang", Sys.getenv('R_COMPILED_BY'))){
    "https://github.com/r-windows/bundles/releases/download/tesseract-5.3.2/tesseract-ocr-5.3.2-clang-x86_64.tar.xz"
  } else if(getRversion() >= "4.3") {
    "https://github.com/r-windows/bundles/releases/download/tesseract-5.3.2/tesseract-ocr-5.3.2-ucrt-x86_64.tar.xz"
  } else {
    "https://github.com/rwinlib/tesseract/archive/v5.3.2.tar.gz"
  }
  download.file(url, basename(url), quiet = TRUE)
  dir.create("../.deps", showWarnings = FALSE)
  untar(basename(url), exdir = "../.deps", tar = 'internal')
  unlink(basename(url))
  setwd("../.deps")
  file.rename(list.files(), 'tesseract')

  # Copy training data
  file.copy('tesseract/share/tessdata', '../inst/', recursive = TRUE)
  download.file("https://github.com/tesseract-ocr/tessdata_fast/raw/4.1.0/eng.traineddata",
                "../inst/tessdata/eng.traineddata", mode = "wb", quiet = TRUE)
  download.file("https://github.com/tesseract-ocr/tessdata_fast/raw/4.1.0/osd.traineddata",
                "../inst/tessdata/osd.traineddata", mode = "wb", quiet = TRUE)
  invisible()
}


================================================
FILE: vignettes/intro.Rmd
================================================
---
title: "Using the Tesseract OCR engine in R"
date: "`r Sys.Date()`"
output:
  html_document:
    toc: true
    toc_depth: 2
    toc_float: true
    fig_caption: false
vignette: >
  %\VignetteIndexEntry{Using the Tesseract OCR engine in R}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


```{r, echo = FALSE, message = FALSE}
library(tibble)
#knitr::opts_chunk$set(comment = "")
has_nld <- "nld" %in% tesseract::tesseract_info()$available
if(identical(Sys.info()[['user']], 'jeroen')) stopifnot(has_nld)
if(grepl('tesseract.Rcheck', getwd())){
  Sys.sleep(10) #workaround for CPU time check
}
```

The tesseract package provides R bindings [Tesseract](https://github.com/tesseract-ocr/tesseract): a powerful optical character recognition (OCR) engine that supports over 100 languages. The engine is highly configurable in order to tune the detection algorithms and obtain the best possible results.

Keep in mind that OCR (pattern recognition in general) is a very difficult problem for computers. Results will rarely be perfect and the accuracy rapidly decreases with the quality of the input image. But if you can get your input images to reasonable quality, Tesseract can often help to extract most of the text from the image.

## Extract Text from Images

OCR is the process of finding and recognizing text inside images, for example from a screenshot, scanned paper. The image below has some example text:

![test](https://jeroen.github.io/images/testocr.png){data-external=1}

```{r}
library(tesseract)
eng <- tesseract("eng")
text <- tesseract::ocr("http://jeroen.github.io/images/testocr.png", engine = eng)
cat(text)
```

Not bad! The `ocr_data()` function returns all words in the image along with a bounding box and confidence rate.

```{r}
results <- tesseract::ocr_data("http://jeroen.github.io/images/testocr.png", engine = eng)
results
```

## Language Data

The tesseract OCR engine uses language-specific training data in the recognize words. The OCR algorithms bias towards words and sentences that frequently appear together in a given language, just like the human brain does. Therefore the most accurate results will be obtained when using training data in the correct language. 

Use `tesseract_info()` to list the languages that you currently have installed.

```{r}
tesseract_info()
```

By default the R package only includes English training data. Windows and Mac users can install additional training data using `tesseract_download()`. Let's OCR a screenshot from Wikipedia in Dutch (Nederlands) 

[![utrecht](https://jeroen.github.io/images/utrecht2.png)](https://nl.wikipedia.org/wiki/Geschiedenis_van_de_stad_Utrecht)

```{r, eval=FALSE}
# Only need to do download once:
tesseract_download("nld")
```

```{r eval = has_nld}
# Now load the dictionary
(dutch <- tesseract("nld"))
text <- ocr("https://jeroen.github.io/images/utrecht2.png", engine = dutch)
cat(text)
```

As you can see immediately: almost perfect! (OK just take my word). 


## Preprocessing with Magick

The accuracy of the OCR process depends on the quality of the input image. You can often improve results by properly scaling the image, removing noise and artifacts or cropping the area where the text exists. See [tesseract wiki: improve quality](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality) for important tips to improve the quality of your input image.

The awesome [magick](https://cran.r-project.org/package=magick/vignettes/intro.html) R package has many useful functions that can be use for enhancing the quality of the image. Some things to try:

 - If your image is skewed, use `image_deskew()` and `image_rotate()` make the text horizontal.
 - `image_trim()` crops out whitespace in the margins. Increase the `fuzz` parameter to make it work for noisy whitespace.
 - Use `image_convert()` to turn the image into greyscale, which can reduce artifacts and enhance actual text.
 - If your image is very large or small resizing with `image_resize()` can help tesseract determine text size.
 - Use `image_modulate()` or `image_contrast()` or `image_contrast()` to tweak brightness / contrast if this is an issue.
 - Try `image_reducenoise()` for automated noise removal. Your mileage may vary.
 - With `image_quantize()` you can reduce the number of colors in the image. This can sometimes help with increasing contrast and reducing artifacts.
 - True imaging ninjas can use `image_convolve()` to use custom [convolution methods](https://ropensci.org/technotes/2017/11/02/image-convolve/). 

Below is an example OCR scan. The code converts it to black-and-white and resizes + crops the image before feeding it to tesseract to get more accurate OCR results.

![bowers](https://jeroen.github.io/images/bowers.jpg){data-external=1}


```{r}
library(magick)
input <- image_read("https://jeroen.github.io/images/bowers.jpg")

text <- input %>%
  image_resize("2000x") %>%
  image_convert(type = 'Grayscale') %>%
  image_trim(fuzz = 40) %>%
  image_write(format = 'png', density = '300x300') %>%
  tesseract::ocr() 

cat(text)
```


## Read from PDF files

If your images are stored in PDF files they first need to be converted to a proper image format. We can do this in R using the `pdf_convert` function from the pdftools package. Use a high DPI to keep quality of the image.

```{r, eval=require(pdftools)}
pngfile <- pdftools::pdf_convert('https://jeroen.github.io/images/ocrscan.pdf', dpi = 600)
text <- tesseract::ocr(pngfile)
cat(text)
```


## Tesseract Control Parameters

Tesseract supports hundreds of "control parameters" which alter the OCR engine. Use `tesseract_params()` to list all parameters with their default value and a brief description. It also has a handy `filter` argument to quickly find parameters that match a particular string.

```{r}
# List all parameters with *colour* in name or description
tesseract_params('colour')
```

Do note that some of the control parameters have changed between Tesseract engine 3 and 4.

```{r}
tesseract::tesseract_info()['version']
```

### Whitelist / Blacklist characters

One powerful parameter is `tessedit_char_whitelist` which restricts the output to a limited set of characters. This may be useful for reading for example numbers such as a bank account, zip code, or gas meter.

The whitelist parameter works for all versions of Tesseract engine 3 and also engine versions 4.1 and higher, but unfortunately it did not work in Tesseract 4.0.


![receipt](https://jeroen.github.io/images/receipt.png){data-external=1}

```{r}
numbers <- tesseract(options = list(tessedit_char_whitelist = "$.0123456789"))
cat(ocr("https://jeroen.github.io/images/receipt.png", engine = numbers))
```

To test if this actually works, look what happens if we remove the `$` from `tessedit_char_whitelist`:

```{r}
# Do not allow any dollar sign 
numbers2 <- tesseract(options = list(tessedit_char_whitelist = ".0123456789"))
cat(ocr("https://jeroen.github.io/images/receipt.png", engine = numbers2))
```