[
  {
    "path": ".Rbuildignore",
    "content": "^.*\\.Rproj$\n^\\.Rproj\\.user$\n^src/Makevars$\n^windows\n\\.pdf$\n\\.png$\n\\.webp$\n\\.jpeg$\n\\.o$\n\\.dll$\n^\\.travis\\.yml$\n^appveyor\\.yml$\n^README.md$\nvignettes/.*\\.png$\n^configure.log$\n^\\.github$\n^\\.deps$\n"
  },
  {
    "path": ".github/.gitignore",
    "content": "*.html\n"
  },
  {
    "path": ".github/workflows/R-CMD-check.yaml",
    "content": "# Workflow derived from https://github.com/r-lib/actions/tree/v2/examples\n# Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help\non:\n  push:\n  pull_request:\n\nname: R-CMD-check.yaml\n\npermissions: read-all\n\njobs:\n  R-CMD-check:\n    runs-on: ${{ matrix.config.os }}\n\n    name: ${{ matrix.config.os }} (${{ matrix.config.r }})\n\n    strategy:\n      fail-fast: false\n      matrix:\n        config:\n          - {os: macos-15-intel,  r: 'release'}\n          - {os: macos-latest,    r: 'next'}\n          - {os: windows-latest , r: '4.1'}\n          - {os: windows-latest , r: '4.2'}\n          - {os: windows-latest , r: '4.3'}\n          - {os: windows-latest , r: '4.4'}\n          - {os: windows-latest , r: 'devel'}\n          - {os: ubuntu-latest,   r: 'devel', http-user-agent: 'release'}\n          - {os: ubuntu-latest,   r: 'release'}\n          - {os: ubuntu-latest,   r: 'oldrel-4'}\n\n    env:\n      GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}\n      R_KEEP_PKG_SOURCE: yes\n\n    steps:\n      - uses: actions/checkout@v4\n\n      - uses: r-lib/actions/setup-pandoc@v2\n\n      - uses: r-lib/actions/setup-r@v2\n        with:\n          r-version: ${{ matrix.config.r }}\n          http-user-agent: ${{ matrix.config.http-user-agent }}\n          use-public-rspm: true\n\n      - uses: r-lib/actions/setup-r-dependencies@v2\n        with:\n          extra-packages: any::rcmdcheck\n          needs: check\n\n      - uses: r-lib/actions/check-r-package@v2\n        env:\n          MAKEFLAGS: -j4\n"
  },
  {
    "path": ".gitignore",
    "content": "*.o\n*.so\n*.dll\n*.a\n*.txt\n*.pdf\n*.png\n*.webp\n*.jpeg\n.Rproj.user\n.Rhistory\ninst/tessdata\nwindows\nsrc/Makevars\nconfigure.log\n"
  },
  {
    "path": "DESCRIPTION",
    "content": "Package: tesseract\nType: Package\nTitle: Open Source OCR Engine\nVersion: 5.2.5\nAuthors@R: person(\"Jeroen\", \"Ooms\", role = c(\"aut\", \"cre\"), email = \"jeroenooms@gmail.com\",\n    comment = c(ORCID = \"0000-0002-4035-0289\"))\nDescription: Bindings to 'Tesseract': \n     a powerful optical character recognition (OCR) engine that supports over 100 languages.\n     The engine is highly configurable in order to tune the detection algorithms and\n     obtain the best possible results.\nLicense: Apache License 2.0\nURL: https://docs.ropensci.org/tesseract/\n    https://ropensci.r-universe.dev/tesseract\nBugReports: https://github.com/ropensci/tesseract/issues\nSystemRequirements: Tesseract >= 3.03 (libtesseract-dev / tesseract-devel) and\n    Leptonica (libleptonica-dev / leptonica-devel). On Debian you need to install\n    the English training data separately (tesseract-ocr-eng)\nImports:\n    Rcpp (>= 0.12.12),\n    pdftools (>= 1.5),    \n    curl,\n    rappdirs,\n    digest\nLinkingTo: Rcpp\nRoxygenNote: 7.3.3\nRoxygen: list(markdown = TRUE)\nSuggests:\n    magick (>= 1.7),\n    spelling,\n    knitr,\n    tibble,\n    rmarkdown\nEncoding: UTF-8\nVignetteBuilder: knitr\nLanguage: en-US\n"
  },
  {
    "path": "NAMESPACE",
    "content": "# Generated by roxygen2: do not edit by hand\n\nS3method(print,tesseract)\nexport(ocr)\nexport(ocr_data)\nexport(tesseract)\nexport(tesseract_download)\nexport(tesseract_info)\nexport(tesseract_params)\nimportFrom(Rcpp,sourceCpp)\nuseDynLib(tesseract)\n"
  },
  {
    "path": "NEWS",
    "content": "5.2.5\n  - Wrap examples in donttest for cran policies\n\n5.2.4\n  - Do not use CXX11 anymore in configure script (fixes R-4.6)\n\n5.2.1\n  - Fix shell script for cross compilation\n\n5.2.0\n  - Windows: update to tesseract 5.3.2\n\n5.1.0\n  - Win: update to tesseract 5.1.0.\n  - Win: apply patch for freezes when running under UTF-8 in R-4.2.\n    See: https://github.com/tesseract-ocr/tesseract/issues/3830\n\n5.0.0\n  - Win/Mac: update to libtesseract 5.0.1\n  - Remove locale workaround on libtesseract 4.1+ (should only be needed for 4.0)\n  - Remove cruft that was needed to support Solaris\n\n4.2.0\n  - Prepare for API changes in upcoming Tesseract 5 release\n  - Change the default language=\"eng\" in tesseract()\n\n4.1.2\n  - Fix for upstream master/main renames in language repos\n\n4.1.1\n  - Win/Mac: update to libtesseract 4.1.1\n\n4.1\n  - Fix memory leak in ocr_data()\n  - Windows / MacOS: update to libtesseract 4.1.0. This re-enables\n    the whitelist/blacklist options that were missing in Tesseract 4.0\n\n4.0\n  - Windows, MacOS: Upgrade to upstream Tesseract 4.0! Completely new OCR engine.\n  - Tesseract 4 has a new training data format. On Windows / MacOS you need to\n    re-download your language data with tesseract_download(). The package uses\n    separate directories for storing Tesseract 3 vs 4 data so they shouldn't get\n    mixed up (hopefully).\n  - Drop hard-dependency on tibble (only load if available)\n\n2.3\n  - Fix problem with setlocale() not properly restoring locale.\n  - Switch examples from dontrun{} to donttest{}, and '--run-donttest' on travis/appveyor\n\n2.2\n  - Fixes for breaking changes in Tesseract 4.0.0 beta.3\n  - Set LC_ALL = C when initiating tesseract\n  - Include <tesseract/*> to support Tesseract 4\n\n2.1\n  - Fixes for 4.0.0-beta.1: they switched to semver + other data branch\n  - Set LC_CTYPE to \"C\" when loading training data (required for some asian languages)\n  - Add back OSD training data on Windows\n\n2.0\n  - Set tesseract parameters at init so that all parameters types now actually work!\n  - New function tesseract_params() lists all supported parameters and their default\n  - Added 'config' argument to tesseract() which specifies a file with parameter values\n  - Internally validate paremeter names before init to revent tesseract crashes\n  - Rewrite the ocr_data() function in C++ to make it much faster\n  - Tesseract 4 now gets data from the tessdata_fast repo as recommended upstream\n  - Use default resolution of 300dpi when image does not contain resolution info\n\n1.9\n  - Tesseract 4 now dowloads training data from the \"tessdata_fast\" repo\n  - Add ocr_data() function that parses the hOCR output\n\n1.8\n  - Add support for HOCR output (#20)\n  - Remove 'script' and 'orientation' attributes in output (doesn't work in Tesseract 4)\n\n1.7 (internal)\n  - Add support upcoming Tesseract 4 (compiler fix + separate tessdata dir)\n  - Configure script now explicitly tests for CXX11 (required by Tesseract 4)\n\n1.6\n  - Windows: update libtesseract to 3.05.01\n  - tesseract_download now uses 3.04 tree (instead of 4.00) as suggested in readme\n  - For static packags on Win/Mac, languages stored in: rappdirs::user_data_dir('tesseract')\n  - Use 'png' instead of 'tiff' to read magick images\n  - Compile with $(C_VISIBILITY) to hide internal symbols (requires Rcpp 0.12.12)\n  - Use Rcpp symbol registration\n\n1.4\n  - Run engine finalizer on R exit (requires Rcpp 0.12.10)\n  - Move autobrew script to separate repository\n  - Add symbol registration\n\n1.3\n  - tesseract() gains an 'options' parameter for setting engine variables\n  - New tessseract_download() function for installing training data on Win/Mac\n  - Initiate default tesseract engine onAttach() to fail for missing training data\n  - Add support for ocr() on magick images\n\n1.2\n  - Try to fix build for CRAN OS-X, again.\n\n1.1\n  - Try to fix build for CRAN OS-X build server\n  - Show 'loaded' and 'available' languages in print.tesseract()\n\n1.0\n  - Initial CRAN release\n"
  },
  {
    "path": "R/RcppExports.R",
    "content": "# Generated by using Rcpp::compileAttributes() -> do not edit by hand\n# Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393\n\ntesseract_config <- function() {\n    .Call('_tesseract_tesseract_config', PACKAGE = 'tesseract')\n}\n\ntesseract_engine_internal <- function(datapath, language, confpaths, opt_names, opt_values) {\n    .Call('_tesseract_tesseract_engine_internal', PACKAGE = 'tesseract', datapath, language, confpaths, opt_names, opt_values)\n}\n\ntesseract_engine_set_variable <- function(ptr, name, value) {\n    .Call('_tesseract_tesseract_engine_set_variable', PACKAGE = 'tesseract', ptr, name, value)\n}\n\nvalidate_params <- function(params) {\n    .Call('_tesseract_validate_params', PACKAGE = 'tesseract', params)\n}\n\nengine_info_internal <- function(ptr) {\n    .Call('_tesseract_engine_info_internal', PACKAGE = 'tesseract', ptr)\n}\n\nprint_params <- function(filename) {\n    .Call('_tesseract_print_params', PACKAGE = 'tesseract', filename)\n}\n\nget_param_values <- function(ptr, params) {\n    .Call('_tesseract_get_param_values', PACKAGE = 'tesseract', ptr, params)\n}\n\nocr_raw <- function(input, ptr, HOCR = FALSE) {\n    .Call('_tesseract_ocr_raw', PACKAGE = 'tesseract', input, ptr, HOCR)\n}\n\nocr_file <- function(file, ptr, HOCR = FALSE) {\n    .Call('_tesseract_ocr_file', PACKAGE = 'tesseract', file, ptr, HOCR)\n}\n\nocr_raw_data <- function(input, ptr) {\n    .Call('_tesseract_ocr_raw_data', PACKAGE = 'tesseract', input, ptr)\n}\n\nocr_file_data <- function(file, ptr) {\n    .Call('_tesseract_ocr_file_data', PACKAGE = 'tesseract', file, ptr)\n}\n\n"
  },
  {
    "path": "R/ocr.R",
    "content": "#' Tesseract OCR\n#'\n#' Extract text from an image. Requires that you have training data for the language you\n#' are reading. Works best for images with high contrast, little noise and horizontal text.\n#' See [tesseract wiki](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality) and\n#' our package vignette for image preprocessing tips.\n#'\n#' The `ocr()` function returns plain text by default, or hOCR text if hOCR is set to `TRUE`.\n#' The `ocr_data()` function returns a data frame with a confidence rate and bounding box for\n#' each word in the text.\n#'\n#' @export\n#' @useDynLib tesseract\n#' @family tesseract\n#' @param image file path, url, or raw vector to image (png, tiff, jpeg, etc)\n#' @param engine a tesseract engine created with [tesseract()]. Alternatively a\n#' language string which will be passed to [tesseract()].\n#' @param HOCR if `TRUE` return results as HOCR xml instead of plain text\n#' @rdname ocr\n#' @references [Tesseract: Improving Quality](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality)\n#' @importFrom Rcpp sourceCpp\n#' @examples \\donttest{\n#' text <- ocr(\"https://jeroen.github.io/images/testocr.png\")\n#' cat(text)\n#'\n#' xml <- ocr(\"https://jeroen.github.io/images/testocr.png\", HOCR = TRUE)\n#' cat(xml)\n#'\n#' df <- ocr_data(\"https://jeroen.github.io/images/testocr.png\")\n#' print(df)\n#'\n#' # Full roundtrip test: render PDF to image and OCR it back to text\n#' curl::curl_download(\"https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf\", \"R-intro.pdf\")\n#' orig <- pdftools::pdf_text(\"R-intro.pdf\")[1]\n#'\n#' # Render pdf to png image\n#' img_file <- pdftools::pdf_convert(\"R-intro.pdf\", format = 'tiff', pages = 1, dpi = 400)\n#' unlink(\"R-intro.pdf\")\n#'\n#' # Extract text from png image\n#' text <- ocr(img_file)\n#' unlink(img_file)\n#' cat(text)\n#' }\n#'\n#' engine <- tesseract(options = list(tessedit_char_whitelist = \"0123456789\"))\nocr <- function(image, engine = tesseract(\"eng\"), HOCR = FALSE) {\n  if(is.character(engine))\n    engine <- tesseract(engine)\n  stopifnot(inherits(engine, \"tesseract\"))\n  if(inherits(image, \"magick-image\")){\n    vapply(image, function(x){\n      tmp <- tempfile(fileext = \".png\")\n      on.exit(unlink(tmp))\n      magick::image_write(x, tmp, format = 'PNG', density = '300x300')\n      ocr(tmp, engine = engine, HOCR = HOCR)\n    }, character(1))\n  } else if(is.character(image)){\n    image <- download_files(image)\n    vapply(image, ocr_file, character(1), ptr = engine, HOCR = HOCR, USE.NAMES = FALSE)\n  } else if(is.raw(image)){\n    ocr_raw(image, engine, HOCR = HOCR)\n  } else {\n    stop(\"Argument 'image' must be file-path, url or raw vector\")\n  }\n}\n\n#' @rdname ocr\n#' @export\nocr_data <- function(image, engine = tesseract(\"eng\")) {\n  if(is.character(engine))\n    engine <- tesseract(engine)\n  stopifnot(inherits(engine, \"tesseract\"))\n  df_list <- if(inherits(image, \"magick-image\")){\n    lapply(image, function(x){\n      tmp <- tempfile(fileext = \".png\")\n      on.exit(unlink(tmp))\n      magick::image_write(x, tmp, format = 'PNG', density = '300x300')\n      ocr_data(tmp, engine = engine)\n    })\n  } else if(is.character(image)){\n    image <- download_files(image)\n    lapply(image, function(im){\n      ocr_file_data(im, ptr = engine)\n    })\n  } else if(is.raw(image)){\n    list(ocr_raw_data(image, engine))\n  } else {\n    stop(\"Argument 'image' must be file-path, url or raw vector\")\n  }\n  df_as_tibble(do.call(rbind.data.frame, unname(df_list)))\n}\n"
  },
  {
    "path": "R/onload.R",
    "content": ".onLoad <- function(lib, pkg){\n  pkgdir <- file.path(lib, pkg)\n  version <- tesseract_version_major()\n  appname <- ifelse(version < 4, \"tesseract\", paste0(\"tesseract\", version))\n  sysdir <- rappdirs::user_data_dir(appname)\n  pkgdata <- normalizePath(file.path(pkgdir, \"tessdata\"), mustWork = FALSE)\n  sysdata <- normalizePath(file.path(sysdir, \"tessdata\"), mustWork = FALSE)\n  if(!is_testload() && file.exists(pkgdata) && !file.exists(file.path(sysdata, \"eng.traineddata\"))){\n    dir.create(sysdir, showWarnings = FALSE, recursive = TRUE)\n    if(file.exists(sysdir)){\n      onload_notify()\n      olddir <- getwd()\n      on.exit(setwd(olddir))\n      setwd(pkgdir)\n      file.copy(\"tessdata\", sysdir, recursive = TRUE)\n    }\n  }\n  if(is.na(Sys.getenv(\"TESSDATA_PREFIX\", NA))){\n    if(file.exists(file.path(sysdata, \"eng.traineddata\"))){\n      Sys.setenv(TESSDATA_PREFIX = sysdata)\n    } else if(file.exists(file.path(pkgdata, \"eng.traineddata\"))){\n      Sys.setenv(TESSDATA_PREFIX = pkgdata)\n    }\n  }\n\n  if(grepl('tesseract.Rcheck', getwd(), fixed = TRUE)){\n    Sys.setenv(OMP_THREAD_LIMIT=2)\n    Sys.setenv(OMP_NUM_THREADS=2)\n  }\n}\n\ntesseract_version_major <- function(){\n  as.numeric(substring(tesseract_config()$version, 1, 1))\n}\n\nonload_notify <- function(){\n  message(\"First use of Tesseract: copying language data...\\n\")\n}\n\nis_testload <- function(){\n  as.logical(nchar(Sys.getenv(\"R_INSTALL_PKG\")))\n}\n\n.onUnload <- function(lib){\n  Sys.unsetenv(\"TESSDATA_PREFIX\")\n}\n\n.onAttach <- function(lib, pkg){\n  check_training_data()\n\n  # Load tibble (if available) for pretty printing\n  if(interactive() && is.null(.getNamespace('tibble'))){\n    tryCatch({\n      getNamespace('tibble')\n    }, error= function(e){})\n  }\n}\n\ncheck_training_data <- function(){\n  tryCatch(tesseract(), error = function(e){\n    warning(\"Unable to find English training data\", call. = FALSE)\n    os <- utils::sessionInfo()$running\n    if(isTRUE(grepl(\"ubuntu|debian\", os, TRUE))){\n      stop(\"DEBIAN / UBUNTU: Please run: apt-get install tesseract-ocr-eng\")\n    }\n  })\n}\n"
  },
  {
    "path": "R/tessdata.R",
    "content": "#' Tesseract Training Data\n#'\n#' Helper function to download training data from the official\n#' [tessdata](https://tesseract-ocr.github.io/tessdoc/Data-Files) repository. On Linux, the fast training data can be installed directly with\n#' [yum](https://src.fedoraproject.org/rpms/tesseract) or\n#' [apt-get](https://packages.debian.org/search?suite=stable&section=all&arch=any&searchon=names&keywords=tesseract-ocr-).\n#'\n#' Tesseract uses training data to perform OCR. Most systems default to English\n#' training data. To improve OCR performance for other languages you can to install the\n#' training data from your distribution. For example to install the spanish training data:\n#'\n#'  - [tesseract-ocr-spa](https://packages.debian.org/testing/tesseract-ocr-spa) (Debian, Ubuntu)\n#'  - `tesseract-langpack-spa` (Fedora, EPEL)\n#'\n#' On Windows and MacOS you can install languages using the [tesseract_download] function\n#' which downloads training data directly from [github](https://github.com/tesseract-ocr/tessdata)\n#' and stores it in a the path on disk given by the `TESSDATA_PREFIX` variable.\n#'\n#' @export\n#' @aliases tessdata\n#' @rdname tessdata\n#' @family tesseract\n#' @param lang three letter code for language, see [tessdata](https://github.com/tesseract-ocr/tessdata) repository.\n#' @param datapath destination directory where to download store the file\n#' @param model either `fast` or `best` is currently supported. The latter downloads\n#' more accurate (but slower) trained models for Tesseract 4.0 or higher\n#' @param progress print progress while downloading\n#' @references [tesseract wiki: training data](https://tesseract-ocr.github.io/tessdoc/Data-Files)\n#' @examples \\dontrun{\n#' if(is.na(match(\"fra\", tesseract_info()$available)))\n#'   tesseract_download(\"fra\", model = 'best')\n#' french <- tesseract(\"fra\")\n#' text <- ocr(\"https://jeroen.github.io/images/french_text.png\", engine = french)\n#' cat(text)\n#' }\ntesseract_download <- function(lang, datapath = NULL, model = c(\"fast\", \"best\"), progress = interactive()) {\n  stopifnot(is.character(lang))\n  model <- match.arg(model)\n  if(!length(datapath)){\n    warn_on_linux()\n    datapath <- tesseract_info()$datapath\n  }\n  datapath <- normalizePath(datapath, mustWork = TRUE)\n  version <- tesseract_version_major()\n\n  if(version < 4){\n    repo <- \"tessdata\"\n    release <- \"3.04.00\"\n  } else {\n    repo <- paste0(\"tessdata_\", model)\n    release <- \"4.1.0\"\n  }\n\n  url <- sprintf(\"https://github.com/tesseract-ocr/%s/raw/%s/%s.traineddata\", repo, release, lang)\n\n  destfile <- file.path(datapath, basename(url))\n\n  if (file.exists(destfile)) {\n    message(paste(\"Training data already exists. Overwriting\", destfile))\n  }\n\n  req <- curl::curl_fetch_memory(url, curl::new_handle(\n    progressfunction = progress_fun,\n    noprogress = !isTRUE(progress)\n  ))\n  if(progress)\n    cat(\"\\n\")\n  if(req$status_code != 200)\n    stop(\"Download failed: HTTP \", req$status_code, call. = FALSE)\n\n  writeBin(req$content, destfile)\n  return(destfile)\n}\n\nprogress_fun <- function(down, up) {\n  total <- down[[1]]\n  now <- down[[2]]\n  pct <- if(length(total) && total > 0){\n    paste0(\"(\", round(now/total * 100), \"%)\")\n  } else {\n    \"\"\n  }\n  if(now > 10000)\n    cat(\"\\r Downloaded:\", sprintf(\"%.2f\", now / 2^20), \"MB \", pct)\n  TRUE\n}\n\nwarn_on_linux <- function(){\n  if(identical(.Platform$OS.type, \"unix\") && !identical(Sys.info()[[\"sysname\"]], \"Darwin\")){\n    warning(\"On Linux you should install training data via yum/apt. Please check the manual page.\", call. = FALSE)\n  }\n}\n"
  },
  {
    "path": "R/tesseract.R",
    "content": "#' Tesseract Engine\n#'\n#' Create an OCR engine for a given language and control parameters. This can be used by\n#' the [ocr] and [ocr_data] functions to recognize text.\n#'\n#' Tesseract control parameters can be set either via a named list in the\n#' `options` parameter, or in a `config` file text file which contains the parameter name\n#' followed by a space and then the value, one per line. Use [tesseract_params()] to list\n#' or find parameters. Note that that some parameters are only supported in certain versions\n#' of libtesseract, and that invalid parameters can sometimes cause libtesseract to crash.\n#'\n#' @export\n#' @rdname tesseract\n#' @family tesseract\n#' @param language string with language for training data. Usually defaults to `eng`\n#' @param datapath path with the training data for this language. Default uses\n#' the system library.\n#' @param configs character vector with files, each containing one or more parameter\n#' values. These config files can exist in the current directory or one of the standard\n#' tesseract config files that live in the tessdata directory. See details.\n#' @param options a named list with tesseract parameters. See details.\n#' @param cache speed things up by caching engines\ntesseract <- local({\n  store <- new.env()\n  function(language = \"eng\", datapath = NULL, configs = NULL, options = NULL, cache = TRUE){\n    datapath <- normalizePath(as.character(datapath), mustWork = TRUE)\n    language <- as.character(language)\n    configs <- as.character(configs)\n    options <- as.list(options)\n    if(isTRUE(cache)){\n      key <- digest::digest(list(language, datapath, configs, options))\n      if(is.null(store[[key]])){\n        ptr <- tesseract_engine(datapath, language, configs, options)\n        assign(key, ptr, store);\n      }\n      store[[key]]\n    } else {\n      tesseract_engine(datapath, language, configs, options)\n    }\n  }\n})\n\n#' @export\n#' @rdname tesseract\n#' @param filter only list parameters containing a particular string\n#' @examples tesseract_params('debug')\ntesseract_params <- function(filter = \"\"){\n  tmp <- print_params(tempfile())\n  on.exit(unlink(tmp))\n  df <- parse_params(tmp)\n  subset <- grepl(filter, paste(df$param, df$desc), ignore.case = TRUE)\n  df_as_tibble(df[subset,])\n}\n\n#' @export\n#' @rdname tesseract\ntesseract_info <- function(){\n  info <- engine_info_internal(tesseract())\n  config <- tesseract_config()\n  list(datapath = info$datapath,\n       available = info$available,\n       version = config$version,\n       configs = list.files(file.path(info$datapath, \"configs\")))\n}\n\nparse_params <- function(path){\n  utils::read.delim(path, header = FALSE, quote = \"\",\n                    col.names = c(\"param\", \"default\", \"desc\"), stringsAsFactors = FALSE)\n}\n\ntesseract_engine <- function(datapath, language, configs, options){\n\n  # Tesseract::read_config_file first checks for local file, then in tessdata\n  lapply(configs, function(confpath){\n    if(file.exists(confpath)){\n      params <- tryCatch(utils::read.table(confpath, quote = \"\"), error = function(e){\n        bail(\"Failed to parse config file '%s': %s\", confpath, e$message)\n      })\n      ok <- validate_params(params$V1)\n      if(any(!ok))\n        bail(\"Unsupported Tesseract parameter(s): [%s] in %s\", paste(params$V1[!ok], collapse = \", \"), confpath)\n    }\n  })\n\n  opt_names <- as.character(names(options))\n  opt_values <- as.character(options)\n  ok <- validate_params(opt_names)\n  if(any(!ok))\n    bail(\"Unsupported Tesseract parameter(s): [%s]\", paste(opt_names[!ok], collapse = \", \"))\n\n  tesseract_engine_internal(datapath, language, configs, opt_names, opt_values)\n}\n\ndownload_files <- function(urls){\n  files <- vapply(urls, function(path){\n    if(grepl(\"^https?://\", path)){\n      tmp <- tempfile(fileext =  basename(path))\n      curl::curl_download(path, tmp)\n      path <- tmp\n    }\n    normalizePath(path, mustWork = TRUE)\n  }, character(1))\n  is_pdf <- grepl(\".pdf$\", files)\n  out <- unlist(lapply(files[is_pdf], function(path){\n    pdftools::pdf_convert(path, dpi = 600)\n  }))\n  c(files[!is_pdf], out)\n}\n\n#' @export\n\"print.tesseract\" <- function(x, ...){\n  info <- engine_info_internal(x)\n  cat(\"<tesseract engine>\\n\")\n  cat(\" loaded:\", info$loaded, \"\\n\")\n  cat(\" datapath:\", info$datapath, \"\\n\")\n  cat(\" available:\", info$available, \"\\n\")\n}\n\nbail <- function(...){\n  stop(sprintf(...), call. = FALSE)\n}\n\ndf_as_tibble <- function(df){\n  stopifnot(is.data.frame(df))\n  class(df) <- c(\"tbl_df\", \"tbl\", \"data.frame\")\n  df\n}\n"
  },
  {
    "path": "README.md",
    "content": "# tesseract\n\n> Bindings to [Tesseract-OCR](https://opensource.google/projects/tesseract): \n  a powerful optical character recognition (OCR) engine that supports over 100 languages.\n  The engine is highly configurable in order to tune the detection algorithms and\n  obtain the best possible results.\n\n[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](http://www.repostatus.org/#active)\n[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/tesseract)](https://cran.r-project.org/package=tesseract)\n[![CRAN RStudio mirror downloads](http://cranlogs.r-pkg.org/badges/tesseract)](https://cran.r-project.org/package=tesseract)\n\n - Upstream Tesseract-OCR documentation: https://tesseract-ocr.github.io/tessdoc/\n - Introduction: https://docs.ropensci.org/tesseract/articles/intro.html\n - Reference: https://docs.ropensci.org/tesseract/reference/ocr.html\n\n## Hello World\n\nSimple example\n\n```r\n# Simple example\ntext <- ocr(\"https://jeroen.github.io/images/testocr.png\")\ncat(text)\n\n# Get XML HOCR output\nxml <- ocr(\"https://jeroen.github.io/images/testocr.png\", HOCR = TRUE)\ncat(xml)\n```\n\nRoundtrip test: render PDF to image and OCR it back to text\n\n```r\n# Full roundtrip test: render PDF to image and OCR it back to text\ncurl::curl_download(\"https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf\", \"R-intro.pdf\")\norig <- pdftools::pdf_text(\"R-intro.pdf\")[1]\n\n# Render pdf to png image\nimg_file <- pdftools::pdf_convert(\"R-intro.pdf\", format = 'tiff', pages = 1, dpi = 400)\n\n# Extract text from png image\ntext <- ocr(img_file)\nunlink(img_file)\ncat(text)\n```\n\n## Installation\n\nOn Windows and MacOS the package binary package can be installed from CRAN:\n\n```r\ninstall.packages(\"tesseract\")\n```\n\nInstallation from source on Linux or OSX requires the `Tesseract` library (see below).\n\n### Install from source\n\n On __Debian__ or __Ubuntu__ install [libtesseract-dev](https://packages.debian.org/testing/libtesseract-dev) and\n[libleptonica-dev](https://packages.debian.org/testing/libleptonica-dev). Also install [tesseract-ocr-eng](https://packages.debian.org/testing/tesseract-ocr-eng) to run examples.\n\n```\nsudo apt-get install -y libtesseract-dev libleptonica-dev tesseract-ocr-eng\n```\n\nOn __Ubuntu__ you can optionally use [this PPA](https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr-devel) to get the latest version of Tesseract:\n\n```\nsudo add-apt-repository ppa:alex-p/tesseract-ocr-devel\nsudo apt-get install -y libtesseract-dev tesseract-ocr-eng\n```\n\nOn __Fedora__ we need [tesseract-devel](https://src.fedoraproject.org/rpms/tesseract) and\n[leptonica-devel](https://src.fedoraproject.org/rpms/leptonica)\n\n```\nsudo yum install tesseract-devel leptonica-devel\n````\n\nOn __RHEL__ and __CentOS__ we need [tesseract-devel](https://src.fedoraproject.org/rpms/tesseract) and\n[leptonica-devel](https://src.fedoraproject.org/rpms/leptonica) from EPEL\n\n```\nsudo yum install epel-release\nsudo yum install tesseract-devel leptonica-devel\n````\n\n\nOn __OS-X__ use [tesseract](https://github.com/Homebrew/homebrew-core/blob/master/Formula/tesseract.rb) from Homebrew:\n\n```\nbrew install tesseract\n```\n\nTesseract uses training data to perform OCR. Most systems default to English\ntraining data. To improve OCR results for other languages you can to install the\nappropriate training data. On Windows and OSX you can do this in R using \n`tesseract_download()`:\n\n\n```r\ntesseract_download('fra')\n```\n\nOn Linux you need to install the appropriate training data from your distribution. \nFor example to install the spanish training data:\n\n  - [tesseract-ocr-spa](https://packages.debian.org/testing/tesseract-ocr-spa) (Debian, Ubuntu)\n  - [tesseract-langpack-spa](https://src.fedoraproject.org/rpms/tesseract-langpack) (Fedora, EPEL)\n\nAlternatively you can manually download training data from [github](https://github.com/tesseract-ocr/tessdata)\nand store it in a path on disk that you pass in the `datapath` parameter or set a default path via the\n`TESSDATA_PREFIX` environment variable. Note that the Tesseract 4 and Tesseract 3 use different \ntraining data format. Make sure to download training data from the branch that matches your libtesseract version.\n\n"
  },
  {
    "path": "cleanup",
    "content": "#!/bin/sh\nrm -f src/Makevars configure.log autobrew\n"
  },
  {
    "path": "configure",
    "content": "# Anticonf (tm) script by Jeroen Ooms (2022)\n# This script will query 'pkg-config' for the required cflags and ldflags.\n# If pkg-config is unavailable or does not find the library, try setting\n# INCLUDE_DIR and LIB_DIR manually via e.g:\n# R CMD INSTALL --configure-vars='INCLUDE_DIR=/.../include LIB_DIR=/.../lib'\n\n# Library settings\nPKG_CONFIG_NAME=\"tesseract\"\nPKG_DEB_NAME=\"libtesseract-dev libleptonica-dev\"\nPKG_RPM_NAME=\"tesseract-devel leptonica-devel\"\nPKG_BREW_NAME=\"tesseract\"\nPKG_TEST_HEADER=\"<baseapi.h>\"\nPKG_CFLAGS=\"-I/usr/include/tesseract -I/usr/include/leptonica\"\nPKG_LIBS=\"-ltesseract\"\n\n# Use pkg-config if available\npkg-config --version >/dev/null 2>&1\nif [ $? -eq 0 ]; then\n  PKGCONFIG_CFLAGS=`pkg-config --cflags --silence-errors ${PKG_CONFIG_NAME}`\n  PKGCONFIG_LIBS=`pkg-config --libs ${PKG_CONFIG_NAME}`\nfi\n# Note that cflags may be empty in case of success\nif [ \"$INCLUDE_DIR\" ] || [ \"$LIB_DIR\" ]; then\n  echo \"Found INCLUDE_DIR and/or LIB_DIR!\"\n  PKG_CFLAGS=\"-I$INCLUDE_DIR $PKG_CFLAGS\"\n  PKG_LIBS=\"-L$LIB_DIR $PKG_LIBS\"\nelif [ \"$PKGCONFIG_CFLAGS\" ] || [ \"$PKGCONFIG_LIBS\" ]; then\n  echo \"Found pkg-config cflags and libs!\"\n  PKG_CFLAGS=${PKGCONFIG_CFLAGS}\n  PKG_LIBS=${PKGCONFIG_LIBS}\nelif [ `uname` = \"Darwin\" ]; then\n  test ! \"$CI\" && brew --version 2>/dev/null\n  if [ $? -eq 0 ]; then\n    BREWDIR=`brew --prefix`\n    PKG_CFLAGS=\"-I$BREWDIR/include/tesseract -I$BREWDIR/include/leptonica\"\n    PKG_LIBS=\"-L$BREWDIR/lib $PKG_LIBS\"\n  else\n    curl -sfL \"https://autobrew.github.io/scripts/tesseract\" > autobrew\n    . ./autobrew\n  fi\nfi\n\n# For debugging\necho \"Using PKG_CFLAGS=$PKG_CFLAGS\"\necho \"Using PKG_LIBS=$PKG_LIBS\"\n\n# Find compiler\nCXX=`${R_HOME}/bin/R CMD config CXX`\nCPPFLAGS=`${R_HOME}/bin/R CMD config CPPFLAGS`\n\n# Test configuration\necho \"Using CXX: ${CXX}\"\n${CXX} -E ${CPPFLAGS} ${PKG_CFLAGS} tools/test.cpp >/dev/null 2>configure.log\n\n# Customize the error\nif [ $? -ne 0 ]; then\n  echo \"--------------------------- [ANTICONF] --------------------------------\"\n  echo \"Configuration failed to find '$PKG_CONFIG_NAME' system library. Try installing:\"\n  echo \" * deb: $PKG_DEB_NAME (Debian, Ubuntu, etc)\"\n  echo \" * rpm: $PKG_RPM_NAME (Fedora, CentOS, RHEL)\"\n  echo \" * brew: $PKG_BREW_NAME (Mac OSX)\"\n  echo \"If $PKG_CONFIG_NAME is already installed, check that 'pkg-config' is in your\"\n  echo \"PATH and PKG_CONFIG_PATH contains a $PKG_CONFIG_NAME.pc file. If pkg-config\"\n  echo \"is unavailable you can set INCLUDE_DIR and LIB_DIR manually via:\"\n  echo \"R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'\"\n  echo \"-------------------------- [ERROR MESSAGE] ---------------------------\"\n  cat configure.log\n  echo \"--------------------------------------------------------------------\"\n  exit 1\nfi\n\n# Write to Makevars\nsed -e \"s|@cflags@|$PKG_CFLAGS|\" -e \"s|@libs@|$PKG_LIBS|\" src/Makevars.in > src/Makevars\n\n# Success\nexit 0\n"
  },
  {
    "path": "configure.win",
    "content": ""
  },
  {
    "path": "inst/AUTHORS",
    "content": "Authors of upstream tesseract library and training data:\n\nRay Smith (lead developer)\nAhmad Abdulkader\nRika Antonova\nNicholas Beato\nJeff Breidenbach\nSamuel Charron\nPhil Cheatle\nSimon Crouch\nDavid Eger\nSheelagh Huddleston\nDan Johnson\nRajesh Katikam\nThomas Kielbus\nDar-Shyang Lee\nZongyi (Joe) Liu\nRobert Moss\nChris Newton\nMichael Reimer\nMarius Renn\nRaquel Romano\nChristy Russon\nShobhit Saxena\nMark Seaman\nFaisal Shafait\nHiroshi Takenaka\nRanjith Unnikrishnan\nJoern Wanke\nPing Ping Xiu\nAndrew Ziem\nOscar Zuniga\n\nCommunity Contributors:\nZdenko Podobný (Maintainer)\nJim Regan (Maintainer)\nJames R Barlow\nAmit Dovev\nMartin Ettl\nTom Morris\nTobias Müller\nEgor Pugin\nSundar M. Vaidya\nStefan Weil\n"
  },
  {
    "path": "inst/COPYRIGHT",
    "content": "The package includes machine-generated training data which is released\nby Tesseract developers under Apache 2.0 license. Both data and license\nare available from: https://github.com/tesseract-ocr/tessdata\n"
  },
  {
    "path": "inst/WORDLIST",
    "content": "config\nEPEL\ngithub\ngreyscale\nhOCR\nHOCR\nhttps\njpeg\nknitr\nlangpack\nlibtesseract\nMacOS\nmagick\nMagick\nNederlands\nocr\nopensource\npdftools\npng\nrmarkdown\nspanish\ntessdata\ntoc\nutrecht\nVignetteEncoding\nVignetteEngine\nVignetteIndexEntry\n"
  },
  {
    "path": "man/ocr.Rd",
    "content": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/ocr.R\n\\name{ocr}\n\\alias{ocr}\n\\alias{ocr_data}\n\\title{Tesseract OCR}\n\\usage{\nocr(image, engine = tesseract(\"eng\"), HOCR = FALSE)\n\nocr_data(image, engine = tesseract(\"eng\"))\n}\n\\arguments{\n\\item{image}{file path, url, or raw vector to image (png, tiff, jpeg, etc)}\n\n\\item{engine}{a tesseract engine created with \\code{\\link[=tesseract]{tesseract()}}. Alternatively a\nlanguage string which will be passed to \\code{\\link[=tesseract]{tesseract()}}.}\n\n\\item{HOCR}{if \\code{TRUE} return results as HOCR xml instead of plain text}\n}\n\\description{\nExtract text from an image. Requires that you have training data for the language you\nare reading. Works best for images with high contrast, little noise and horizontal text.\nSee \\href{https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality}{tesseract wiki} and\nour package vignette for image preprocessing tips.\n}\n\\details{\nThe \\code{ocr()} function returns plain text by default, or hOCR text if hOCR is set to \\code{TRUE}.\nThe \\code{ocr_data()} function returns a data frame with a confidence rate and bounding box for\neach word in the text.\n}\n\\examples{\n\\donttest{\ntext <- ocr(\"https://jeroen.github.io/images/testocr.png\")\ncat(text)\n\nxml <- ocr(\"https://jeroen.github.io/images/testocr.png\", HOCR = TRUE)\ncat(xml)\n\ndf <- ocr_data(\"https://jeroen.github.io/images/testocr.png\")\nprint(df)\n\n# Full roundtrip test: render PDF to image and OCR it back to text\ncurl::curl_download(\"https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf\", \"R-intro.pdf\")\norig <- pdftools::pdf_text(\"R-intro.pdf\")[1]\n\n# Render pdf to png image\nimg_file <- pdftools::pdf_convert(\"R-intro.pdf\", format = 'tiff', pages = 1, dpi = 400)\nunlink(\"R-intro.pdf\")\n\n# Extract text from png image\ntext <- ocr(img_file)\nunlink(img_file)\ncat(text)\n}\n\nengine <- tesseract(options = list(tessedit_char_whitelist = \"0123456789\"))\n}\n\\references{\n\\href{https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality}{Tesseract: Improving Quality}\n}\n\\seealso{\nOther tesseract: \n\\code{\\link{tesseract}()},\n\\code{\\link{tesseract_download}()}\n}\n\\concept{tesseract}\n"
  },
  {
    "path": "man/tessdata.Rd",
    "content": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/tessdata.R\n\\name{tesseract_download}\n\\alias{tesseract_download}\n\\alias{tessdata}\n\\title{Tesseract Training Data}\n\\usage{\ntesseract_download(\n  lang,\n  datapath = NULL,\n  model = c(\"fast\", \"best\"),\n  progress = interactive()\n)\n}\n\\arguments{\n\\item{lang}{three letter code for language, see \\href{https://github.com/tesseract-ocr/tessdata}{tessdata} repository.}\n\n\\item{datapath}{destination directory where to download store the file}\n\n\\item{model}{either \\code{fast} or \\code{best} is currently supported. The latter downloads\nmore accurate (but slower) trained models for Tesseract 4.0 or higher}\n\n\\item{progress}{print progress while downloading}\n}\n\\description{\nHelper function to download training data from the official\n\\href{https://tesseract-ocr.github.io/tessdoc/Data-Files}{tessdata} repository. On Linux, the fast training data can be installed directly with\n\\href{https://src.fedoraproject.org/rpms/tesseract}{yum} or\n\\href{https://packages.debian.org/search?suite=stable&section=all&arch=any&searchon=names&keywords=tesseract-ocr-}{apt-get}.\n}\n\\details{\nTesseract uses training data to perform OCR. Most systems default to English\ntraining data. To improve OCR performance for other languages you can to install the\ntraining data from your distribution. For example to install the spanish training data:\n\\itemize{\n\\item \\href{https://packages.debian.org/testing/tesseract-ocr-spa}{tesseract-ocr-spa} (Debian, Ubuntu)\n\\item \\code{tesseract-langpack-spa} (Fedora, EPEL)\n}\n\nOn Windows and MacOS you can install languages using the \\link{tesseract_download} function\nwhich downloads training data directly from \\href{https://github.com/tesseract-ocr/tessdata}{github}\nand stores it in a the path on disk given by the \\code{TESSDATA_PREFIX} variable.\n}\n\\examples{\n\\dontrun{\nif(is.na(match(\"fra\", tesseract_info()$available)))\n  tesseract_download(\"fra\", model = 'best')\nfrench <- tesseract(\"fra\")\ntext <- ocr(\"https://jeroen.github.io/images/french_text.png\", engine = french)\ncat(text)\n}\n}\n\\references{\n\\href{https://tesseract-ocr.github.io/tessdoc/Data-Files}{tesseract wiki: training data}\n}\n\\seealso{\nOther tesseract: \n\\code{\\link{ocr}()},\n\\code{\\link{tesseract}()}\n}\n\\concept{tesseract}\n"
  },
  {
    "path": "man/tesseract.Rd",
    "content": "% Generated by roxygen2: do not edit by hand\n% Please edit documentation in R/tesseract.R\n\\name{tesseract}\n\\alias{tesseract}\n\\alias{tesseract_params}\n\\alias{tesseract_info}\n\\title{Tesseract Engine}\n\\usage{\ntesseract(\n  language = \"eng\",\n  datapath = NULL,\n  configs = NULL,\n  options = NULL,\n  cache = TRUE\n)\n\ntesseract_params(filter = \"\")\n\ntesseract_info()\n}\n\\arguments{\n\\item{language}{string with language for training data. Usually defaults to \\code{eng}}\n\n\\item{datapath}{path with the training data for this language. Default uses\nthe system library.}\n\n\\item{configs}{character vector with files, each containing one or more parameter\nvalues. These config files can exist in the current directory or one of the standard\ntesseract config files that live in the tessdata directory. See details.}\n\n\\item{options}{a named list with tesseract parameters. See details.}\n\n\\item{cache}{speed things up by caching engines}\n\n\\item{filter}{only list parameters containing a particular string}\n}\n\\description{\nCreate an OCR engine for a given language and control parameters. This can be used by\nthe \\link{ocr} and \\link{ocr_data} functions to recognize text.\n}\n\\details{\nTesseract control parameters can be set either via a named list in the\n\\code{options} parameter, or in a \\code{config} file text file which contains the parameter name\nfollowed by a space and then the value, one per line. Use \\code{\\link[=tesseract_params]{tesseract_params()}} to list\nor find parameters. Note that that some parameters are only supported in certain versions\nof libtesseract, and that invalid parameters can sometimes cause libtesseract to crash.\n}\n\\examples{\ntesseract_params('debug')\n}\n\\seealso{\nOther tesseract: \n\\code{\\link{ocr}()},\n\\code{\\link{tesseract_download}()}\n}\n\\concept{tesseract}\n"
  },
  {
    "path": "src/Makevars.in",
    "content": "PKG_CPPFLAGS=@cflags@\nPKG_LIBS=@libs@\n\nPKG_CXXFLAGS=$(CXX_VISIBILITY)\n\nall: $(SHLIB) cleanup\n\ncleanup: $(SHLIB)\n\t@rm -Rf ../.deps\n"
  },
  {
    "path": "src/Makevars.win",
    "content": "RWINLIB = ../.deps/tesseract\nPKG_CPPFLAGS = -I$(RWINLIB)/include -I$(RWINLIB)/include/leptonica\n\nPKG_LIBS = \\\n\t-L$(RWINLIB)/lib${subst gcc,,${COMPILED_BY}}${R_ARCH} \\\n\t-L$(RWINLIB)/lib \\\n\t-ltesseract -lleptonica \\\n\t-ltiff -lopenjp2 -lwebp -lsharpyuv -ljpeg -lgif -lpng16 -lz \\\n\t-lws2_32\n\nall: $(SHLIB) cleanup\n\n# Needed for parallel make\n$(OBJECTS): | $(RWINLIB)\n\n$(RWINLIB):\n\t@\"${R_HOME}/bin${R_ARCH_BIN}/Rscript.exe\" \"../tools/winlibs.R\"\n\ncleanup: $(SHLIB)\n\t@rm -Rf $(RWINLIB)\n"
  },
  {
    "path": "src/RcppExports.cpp",
    "content": "// Generated by using Rcpp::compileAttributes() -> do not edit by hand\n// Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393\n\n#include \"tesseract_types.h\"\n#include <Rcpp.h>\n\nusing namespace Rcpp;\n\n#ifdef RCPP_USE_GLOBAL_ROSTREAM\nRcpp::Rostream<true>&  Rcpp::Rcout = Rcpp::Rcpp_cout_get();\nRcpp::Rostream<false>& Rcpp::Rcerr = Rcpp::Rcpp_cerr_get();\n#endif\n\n// tesseract_config\nRcpp::List tesseract_config();\nRcppExport SEXP _tesseract_tesseract_config() {\nBEGIN_RCPP\n    Rcpp::RObject rcpp_result_gen;\n    Rcpp::RNGScope rcpp_rngScope_gen;\n    rcpp_result_gen = Rcpp::wrap(tesseract_config());\n    return rcpp_result_gen;\nEND_RCPP\n}\n// tesseract_engine_internal\nTessPtr tesseract_engine_internal(Rcpp::CharacterVector datapath, Rcpp::CharacterVector language, Rcpp::CharacterVector confpaths, Rcpp::CharacterVector opt_names, Rcpp::CharacterVector opt_values);\nRcppExport SEXP _tesseract_tesseract_engine_internal(SEXP datapathSEXP, SEXP languageSEXP, SEXP confpathsSEXP, SEXP opt_namesSEXP, SEXP opt_valuesSEXP) {\nBEGIN_RCPP\n    Rcpp::RObject rcpp_result_gen;\n    Rcpp::RNGScope rcpp_rngScope_gen;\n    Rcpp::traits::input_parameter< Rcpp::CharacterVector >::type datapath(datapathSEXP);\n    Rcpp::traits::input_parameter< Rcpp::CharacterVector >::type language(languageSEXP);\n    Rcpp::traits::input_parameter< Rcpp::CharacterVector >::type confpaths(confpathsSEXP);\n    Rcpp::traits::input_parameter< Rcpp::CharacterVector >::type opt_names(opt_namesSEXP);\n    Rcpp::traits::input_parameter< Rcpp::CharacterVector >::type opt_values(opt_valuesSEXP);\n    rcpp_result_gen = Rcpp::wrap(tesseract_engine_internal(datapath, language, confpaths, opt_names, opt_values));\n    return rcpp_result_gen;\nEND_RCPP\n}\n// tesseract_engine_set_variable\nTessPtr tesseract_engine_set_variable(TessPtr ptr, const char * name, const char * value);\nRcppExport SEXP _tesseract_tesseract_engine_set_variable(SEXP ptrSEXP, SEXP nameSEXP, SEXP valueSEXP) {\nBEGIN_RCPP\n    Rcpp::RObject rcpp_result_gen;\n    Rcpp::RNGScope rcpp_rngScope_gen;\n    Rcpp::traits::input_parameter< TessPtr >::type ptr(ptrSEXP);\n    Rcpp::traits::input_parameter< const char * >::type name(nameSEXP);\n    Rcpp::traits::input_parameter< const char * >::type value(valueSEXP);\n    rcpp_result_gen = Rcpp::wrap(tesseract_engine_set_variable(ptr, name, value));\n    return rcpp_result_gen;\nEND_RCPP\n}\n// validate_params\nRcpp::LogicalVector validate_params(Rcpp::CharacterVector params);\nRcppExport SEXP _tesseract_validate_params(SEXP paramsSEXP) {\nBEGIN_RCPP\n    Rcpp::RObject rcpp_result_gen;\n    Rcpp::RNGScope rcpp_rngScope_gen;\n    Rcpp::traits::input_parameter< Rcpp::CharacterVector >::type params(paramsSEXP);\n    rcpp_result_gen = Rcpp::wrap(validate_params(params));\n    return rcpp_result_gen;\nEND_RCPP\n}\n// engine_info_internal\nRcpp::List engine_info_internal(TessPtr ptr);\nRcppExport SEXP _tesseract_engine_info_internal(SEXP ptrSEXP) {\nBEGIN_RCPP\n    Rcpp::RObject rcpp_result_gen;\n    Rcpp::RNGScope rcpp_rngScope_gen;\n    Rcpp::traits::input_parameter< TessPtr >::type ptr(ptrSEXP);\n    rcpp_result_gen = Rcpp::wrap(engine_info_internal(ptr));\n    return rcpp_result_gen;\nEND_RCPP\n}\n// print_params\nRcpp::String print_params(std::string filename);\nRcppExport SEXP _tesseract_print_params(SEXP filenameSEXP) {\nBEGIN_RCPP\n    Rcpp::RObject rcpp_result_gen;\n    Rcpp::RNGScope rcpp_rngScope_gen;\n    Rcpp::traits::input_parameter< std::string >::type filename(filenameSEXP);\n    rcpp_result_gen = Rcpp::wrap(print_params(filename));\n    return rcpp_result_gen;\nEND_RCPP\n}\n// get_param_values\nRcpp::CharacterVector get_param_values(TessPtr ptr, Rcpp::CharacterVector params);\nRcppExport SEXP _tesseract_get_param_values(SEXP ptrSEXP, SEXP paramsSEXP) {\nBEGIN_RCPP\n    Rcpp::RObject rcpp_result_gen;\n    Rcpp::RNGScope rcpp_rngScope_gen;\n    Rcpp::traits::input_parameter< TessPtr >::type ptr(ptrSEXP);\n    Rcpp::traits::input_parameter< Rcpp::CharacterVector >::type params(paramsSEXP);\n    rcpp_result_gen = Rcpp::wrap(get_param_values(ptr, params));\n    return rcpp_result_gen;\nEND_RCPP\n}\n// ocr_raw\nRcpp::String ocr_raw(Rcpp::RawVector input, TessPtr ptr, bool HOCR);\nRcppExport SEXP _tesseract_ocr_raw(SEXP inputSEXP, SEXP ptrSEXP, SEXP HOCRSEXP) {\nBEGIN_RCPP\n    Rcpp::RObject rcpp_result_gen;\n    Rcpp::RNGScope rcpp_rngScope_gen;\n    Rcpp::traits::input_parameter< Rcpp::RawVector >::type input(inputSEXP);\n    Rcpp::traits::input_parameter< TessPtr >::type ptr(ptrSEXP);\n    Rcpp::traits::input_parameter< bool >::type HOCR(HOCRSEXP);\n    rcpp_result_gen = Rcpp::wrap(ocr_raw(input, ptr, HOCR));\n    return rcpp_result_gen;\nEND_RCPP\n}\n// ocr_file\nRcpp::String ocr_file(std::string file, TessPtr ptr, bool HOCR);\nRcppExport SEXP _tesseract_ocr_file(SEXP fileSEXP, SEXP ptrSEXP, SEXP HOCRSEXP) {\nBEGIN_RCPP\n    Rcpp::RObject rcpp_result_gen;\n    Rcpp::RNGScope rcpp_rngScope_gen;\n    Rcpp::traits::input_parameter< std::string >::type file(fileSEXP);\n    Rcpp::traits::input_parameter< TessPtr >::type ptr(ptrSEXP);\n    Rcpp::traits::input_parameter< bool >::type HOCR(HOCRSEXP);\n    rcpp_result_gen = Rcpp::wrap(ocr_file(file, ptr, HOCR));\n    return rcpp_result_gen;\nEND_RCPP\n}\n// ocr_raw_data\nRcpp::DataFrame ocr_raw_data(Rcpp::RawVector input, TessPtr ptr);\nRcppExport SEXP _tesseract_ocr_raw_data(SEXP inputSEXP, SEXP ptrSEXP) {\nBEGIN_RCPP\n    Rcpp::RObject rcpp_result_gen;\n    Rcpp::RNGScope rcpp_rngScope_gen;\n    Rcpp::traits::input_parameter< Rcpp::RawVector >::type input(inputSEXP);\n    Rcpp::traits::input_parameter< TessPtr >::type ptr(ptrSEXP);\n    rcpp_result_gen = Rcpp::wrap(ocr_raw_data(input, ptr));\n    return rcpp_result_gen;\nEND_RCPP\n}\n// ocr_file_data\nRcpp::DataFrame ocr_file_data(std::string file, TessPtr ptr);\nRcppExport SEXP _tesseract_ocr_file_data(SEXP fileSEXP, SEXP ptrSEXP) {\nBEGIN_RCPP\n    Rcpp::RObject rcpp_result_gen;\n    Rcpp::RNGScope rcpp_rngScope_gen;\n    Rcpp::traits::input_parameter< std::string >::type file(fileSEXP);\n    Rcpp::traits::input_parameter< TessPtr >::type ptr(ptrSEXP);\n    rcpp_result_gen = Rcpp::wrap(ocr_file_data(file, ptr));\n    return rcpp_result_gen;\nEND_RCPP\n}\n\nstatic const R_CallMethodDef CallEntries[] = {\n    {\"_tesseract_tesseract_config\", (DL_FUNC) &_tesseract_tesseract_config, 0},\n    {\"_tesseract_tesseract_engine_internal\", (DL_FUNC) &_tesseract_tesseract_engine_internal, 5},\n    {\"_tesseract_tesseract_engine_set_variable\", (DL_FUNC) &_tesseract_tesseract_engine_set_variable, 3},\n    {\"_tesseract_validate_params\", (DL_FUNC) &_tesseract_validate_params, 1},\n    {\"_tesseract_engine_info_internal\", (DL_FUNC) &_tesseract_engine_info_internal, 1},\n    {\"_tesseract_print_params\", (DL_FUNC) &_tesseract_print_params, 1},\n    {\"_tesseract_get_param_values\", (DL_FUNC) &_tesseract_get_param_values, 2},\n    {\"_tesseract_ocr_raw\", (DL_FUNC) &_tesseract_ocr_raw, 3},\n    {\"_tesseract_ocr_file\", (DL_FUNC) &_tesseract_ocr_file, 3},\n    {\"_tesseract_ocr_raw_data\", (DL_FUNC) &_tesseract_ocr_raw_data, 2},\n    {\"_tesseract_ocr_file_data\", (DL_FUNC) &_tesseract_ocr_file_data, 2},\n    {NULL, NULL, 0}\n};\n\nRcppExport void R_init_tesseract(DllInfo *dll) {\n    R_registerRoutines(dll, NULL, CallEntries, NULL, NULL);\n    R_useDynamicSymbols(dll, FALSE);\n}\n"
  },
  {
    "path": "src/tesseract.cpp",
    "content": "#include \"tesseract_types.h\"\n#if TESSERACT_MAJOR_VERSION < 5\n#include <tesseract/genericvector.h>\n#define getorat get\n#else\n#define STRING std::string\n#define GenericVector std::vector\n#define getorat at\n#endif\n\n/* libtesseract 4.0 insisted that the engine is initiated in 'C' locale.\n * We do this as exemplified in the example code in the libc manual:\n * https://www.gnu.org/software/libc/manual/html_node/Setting-the-Locale.html\n * Full discussion: https://github.com/tesseract-ocr/tesseract/issues/1670\n */\n#if TESSERACT_MAJOR_VERSION == 4 && TESSERACT_MINOR_VERSION == 0\n#define TESSERACT40\n#endif\n\nstatic tesseract::TessBaseAPI *make_analyze_api(){\n#ifdef TESSERACT40\n  char *old_ctype = strdup(setlocale(LC_ALL, NULL));\n  setlocale(LC_ALL, \"C\");\n#endif\n  tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();\n  api->InitForAnalysePage();\n#ifdef TESSERACT40\n  setlocale(LC_ALL, old_ctype);\n  free(old_ctype);\n#endif\n  return api;\n}\n\n// [[Rcpp::export]]\nRcpp::List tesseract_config(){\n  tesseract::TessBaseAPI *api = make_analyze_api();\n  Rcpp::List out = Rcpp::List::create(\n    Rcpp::_[\"version\"] = tesseract::TessBaseAPI::Version(),\n    Rcpp::_[\"path\"] = api->GetDatapath()\n  );\n  api->End();\n  delete api;\n  return out;\n}\n\n// [[Rcpp::export]]\nTessPtr tesseract_engine_internal(Rcpp::CharacterVector datapath, Rcpp::CharacterVector language, Rcpp::CharacterVector confpaths,\n                                  Rcpp::CharacterVector opt_names, Rcpp::CharacterVector opt_values){\n  GenericVector<STRING> params, values;\n  const char * path = NULL;\n  const char * lang = NULL;\n  char * configs[1000] = {0};\n  if(datapath.length())\n    path = datapath.at(0);\n  if(language.length())\n    lang = language.at(0);\n  for(int i = 0; i < confpaths.length(); i++)\n    configs[i] = confpaths.at(i);\n  for(int i = 0; i < opt_names.length(); i++){\n    params.push_back(std::string(opt_names.at(i)).c_str());\n    values.push_back(std::string(opt_values.at(i)).c_str());\n  }\n#ifdef TESSERACT40\n  char *old_ctype = strdup(setlocale(LC_ALL, NULL));\n  setlocale(LC_ALL, \"C\");\n#endif\n  tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();\n  int err = api->Init(path, lang, tesseract::OEM_DEFAULT, configs, confpaths.length(), &params, &values, false);\n#ifdef TESSERACT40\n  setlocale(LC_ALL, old_ctype);\n  free(old_ctype);\n#endif\n  if(err){\n    delete api;\n    throw std::runtime_error(std::string(\"Unable to find training data for: \") + (lang ? lang : \"eng\") + \". Please consult manual for: ?tesseract_download\");\n  }\n  TessPtr ptr(api);\n  ptr.attr(\"class\") = Rcpp::CharacterVector::create(\"tesseract\");\n  return ptr;\n}\n\ntesseract::TessBaseAPI * get_engine(TessPtr engine){\n  tesseract::TessBaseAPI * api = engine.get();\n  if(api == NULL)\n    throw std::runtime_error(\"pointer is dead\");\n  return api;\n}\n\n// [[Rcpp::export]]\nTessPtr tesseract_engine_set_variable(TessPtr ptr, const char * name, const char * value){\n  tesseract::TessBaseAPI * api = get_engine(ptr);\n  if(!api->SetVariable(name, value))\n    throw std::runtime_error(std::string(\"Failed to set variable \") + name);\n  return ptr;\n}\n\n// [[Rcpp::export]]\nRcpp::LogicalVector validate_params(Rcpp::CharacterVector params){\n  STRING str;\n  tesseract::TessBaseAPI *api = make_analyze_api();\n  Rcpp::LogicalVector out(params.length());\n  for(int i = 0; i < params.length(); i++)\n    out[i] = api->GetVariableAsString(params.at(i), &str);\n  api->End();\n  delete api;\n  return out;\n}\n\n// [[Rcpp::export]]\nRcpp::List engine_info_internal(TessPtr ptr){\n  tesseract::TessBaseAPI * api = get_engine(ptr);\n  GenericVector<STRING> langs;\n  api->GetAvailableLanguagesAsVector(&langs);\n  Rcpp::CharacterVector available = Rcpp::CharacterVector::create();\n  for (size_t i = 0; i < langs.size(); i++)\n    available.push_back(langs.getorat(i).c_str());\n  langs.clear();\n  api->GetLoadedLanguagesAsVector(&langs);\n  Rcpp::CharacterVector loaded = Rcpp::CharacterVector::create();\n  for (size_t i = 0; i < langs.size(); i++)\n    loaded.push_back(langs.getorat(i).c_str());\n  return Rcpp::List::create(\n    Rcpp::_[\"datapath\"] = api->GetDatapath(),\n    Rcpp::_[\"loaded\"] = loaded,\n    Rcpp::_[\"available\"] = available\n  );\n}\n\n// [[Rcpp::export]]\nRcpp::String print_params(std::string filename){\n  tesseract::TessBaseAPI *api = make_analyze_api();\n  FILE * fp = fopen(filename.c_str(), \"w\");\n  api->PrintVariables(fp);\n  fclose(fp);\n  api->End();\n  delete api;\n  return filename;\n}\n\n// [[Rcpp::export]]\nRcpp::CharacterVector get_param_values(TessPtr ptr, Rcpp::CharacterVector params){\n  STRING str;\n  tesseract::TessBaseAPI * api = get_engine(ptr);\n  Rcpp::CharacterVector out(params.length());\n  for(int i = 0; i < params.length(); i++)\n    out[i] = api->GetVariableAsString(params.at(i), &str) ? Rcpp::String(str.c_str()) : NA_STRING;\n  return out;\n}\n\nRcpp::String ocr_pix(tesseract::TessBaseAPI * api, Pix * image, bool HOCR){\n  // Get OCR result\n  api->ClearAdaptiveClassifier();\n  api->SetImage(image);\n\n  // Workaround for annoying warning, see https://github.com/tesseract-ocr/tesseract/issues/756\n  if(api->GetSourceYResolution() < 70)\n    api->SetSourceResolution(300);\n  char *outText = HOCR ? api->GetHOCRText(0) : api->GetUTF8Text();\n\n  //cleanup\n  pixDestroy(&image);\n  api->Clear();\n\n  // Destroy used object and release memory\n  Rcpp::String y(outText);\n  y.set_encoding(CE_UTF8);\n  delete [] outText;\n  return y;\n}\n\n// [[Rcpp::export]]\nRcpp::String ocr_raw(Rcpp::RawVector input, TessPtr ptr, bool HOCR = false){\n    tesseract::TessBaseAPI *api = get_engine(ptr);\n    Pix *image =  pixReadMem(input.begin(), input.length());\n    if(!image)\n      throw std::runtime_error(\"Failed to read image\");\n    return ocr_pix(api, image, HOCR);\n}\n\n// [[Rcpp::export]]\nRcpp::String ocr_file(std::string file, TessPtr ptr, bool HOCR = false){\n  tesseract::TessBaseAPI *api = get_engine(ptr);\n  Pix *image =  pixRead(file.c_str());\n  if(!image)\n    throw std::runtime_error(\"Failed to read image\");\n  return ocr_pix(api, image, HOCR);\n}\n\nRcpp::DataFrame ocr_data_internal(tesseract::TessBaseAPI * api, Pix * image){\n  api->ClearAdaptiveClassifier();\n  api->SetImage(image);\n  if(api->GetSourceYResolution() < 70)\n    api->SetSourceResolution(300);\n  api->Recognize(0);\n  tesseract::ResultIterator* ri = api->GetIterator();\n  tesseract::PageIteratorLevel level = tesseract::RIL_WORD;\n  size_t n = 0;\n  std::list<std::string> words;\n  std::list<std::string> bbox;\n  std::list<float> conf;\n  char buf[100];\n  if (ri) {\n    do {\n      const char * word = ri->GetUTF8Text(level);\n      if(!word)\n        continue;\n      words.push_back(word);\n      conf.push_back(ri->Confidence(level));\n      int x1, y1, x2, y2;\n      ri->BoundingBox(level, &x1, &y1, &x2, &y2);\n      snprintf(buf, 100, \"%d,%d,%d,%d\", x1, y1, x2, y2);\n      bbox.push_back(buf);\n      delete[] word;\n      n++;\n    } while (ri->Next(level));\n  }\n  Rcpp::CharacterVector rwords(n);\n  Rcpp::CharacterVector rbbox(n);\n  Rcpp::NumericVector rconf(n);\n  for(size_t i = 0; i < n; i++) {\n    rwords[i] = words.front(); words.pop_front();\n    rbbox[i] = bbox.front(); bbox.pop_front();\n    rconf[i] = conf.front(); conf.pop_front();\n  }\n\n  //cleanup\n  pixDestroy(&image);\n  api->Clear();\n  delete ri;\n\n  return Rcpp::DataFrame::create(\n    Rcpp::_[\"word\"] = rwords,\n    Rcpp::_[\"confidence\"] = rconf,\n    Rcpp::_[\"bbox\"] = rbbox,\n    Rcpp::_[\"stringsAsFactors\"] = false\n  );\n}\n\n// [[Rcpp::export]]\nRcpp::DataFrame ocr_raw_data(Rcpp::RawVector input, TessPtr ptr){\n  tesseract::TessBaseAPI *api = get_engine(ptr);\n  Pix *image =  pixReadMem(input.begin(), input.length());\n  if(!image)\n    throw std::runtime_error(\"Failed to read image\");\n  return ocr_data_internal(api, image);\n}\n\n// [[Rcpp::export]]\nRcpp::DataFrame ocr_file_data(std::string file, TessPtr ptr){\n  tesseract::TessBaseAPI *api = get_engine(ptr);\n  Pix *image =  pixRead(file.c_str());\n  if(!image)\n    throw std::runtime_error(\"Failed to read image\");\n  return ocr_data_internal(api, image);\n}\n"
  },
  {
    "path": "src/tesseract_types.h",
    "content": "#include <tesseract/baseapi.h>\n#include <allheaders.h>\n\n#define R_NO_REMAP\n#define STRICT_R_HEADERS\n\n#include <Rcpp.h>\n\ninline void tess_finalizer(tesseract::TessBaseAPI *engine) {\n  engine->End();\n  delete engine;\n}\n\ntypedef Rcpp::XPtr<tesseract::TessBaseAPI, Rcpp::PreserveStorage, tess_finalizer, true> TessPtr;\n"
  },
  {
    "path": "tesseract.Rproj",
    "content": "Version: 1.0\nProjectId: 953b2ed1-ac9d-4be8-984d-c26d5c642f38\n\nRestoreWorkspace: Default\nSaveWorkspace: Default\nAlwaysSaveHistory: Default\n\nEnableCodeIndexing: Yes\nUseSpacesForTab: Yes\nNumSpacesForTab: 2\nEncoding: UTF-8\n\nRnwWeave: Sweave\nLaTeX: pdfLaTeX\n\nAutoAppendNewline: Yes\nStripTrailingWhitespace: Yes\n\nBuildType: Package\nPackageUseDevtools: Yes\nPackageInstallArgs: --no-multiarch --with-keep.source\nPackageRoxygenize: rd,namespace\n"
  },
  {
    "path": "tests/spelling.R",
    "content": "spelling::spell_check_test(vignettes = TRUE, error = FALSE)\n"
  },
  {
    "path": "tools/test.cpp",
    "content": "#include <tesseract/baseapi.h>\n#include <allheaders.h>\n"
  },
  {
    "path": "tools/winlibs.R",
    "content": "if(!file.exists('tesseract.o') && !file.exists(\"../.deps/tesseract/include/tesseract/baseapi.h\")){\n  unlink(\"../.deps\", recursive = TRUE)\n  url <- if(grepl(\"aarch\", R.version$platform)){\n    \"https://github.com/r-windows/bundles/releases/download/tesseract-5.3.2/tesseract-ocr-5.3.2-clang-aarch64.tar.xz\"\n  } else if(grepl(\"clang\", Sys.getenv('R_COMPILED_BY'))){\n    \"https://github.com/r-windows/bundles/releases/download/tesseract-5.3.2/tesseract-ocr-5.3.2-clang-x86_64.tar.xz\"\n  } else if(getRversion() >= \"4.3\") {\n    \"https://github.com/r-windows/bundles/releases/download/tesseract-5.3.2/tesseract-ocr-5.3.2-ucrt-x86_64.tar.xz\"\n  } else {\n    \"https://github.com/rwinlib/tesseract/archive/v5.3.2.tar.gz\"\n  }\n  download.file(url, basename(url), quiet = TRUE)\n  dir.create(\"../.deps\", showWarnings = FALSE)\n  untar(basename(url), exdir = \"../.deps\", tar = 'internal')\n  unlink(basename(url))\n  setwd(\"../.deps\")\n  file.rename(list.files(), 'tesseract')\n\n  # Copy training data\n  file.copy('tesseract/share/tessdata', '../inst/', recursive = TRUE)\n  download.file(\"https://github.com/tesseract-ocr/tessdata_fast/raw/4.1.0/eng.traineddata\",\n                \"../inst/tessdata/eng.traineddata\", mode = \"wb\", quiet = TRUE)\n  download.file(\"https://github.com/tesseract-ocr/tessdata_fast/raw/4.1.0/osd.traineddata\",\n                \"../inst/tessdata/osd.traineddata\", mode = \"wb\", quiet = TRUE)\n  invisible()\n}\n"
  },
  {
    "path": "vignettes/intro.Rmd",
    "content": "---\ntitle: \"Using the Tesseract OCR engine in R\"\ndate: \"`r Sys.Date()`\"\noutput:\n  html_document:\n    toc: true\n    toc_depth: 2\n    toc_float: true\n    fig_caption: false\nvignette: >\n  %\\VignetteIndexEntry{Using the Tesseract OCR engine in R}\n  %\\VignetteEngine{knitr::rmarkdown}\n  %\\VignetteEncoding{UTF-8}\n---\n\n\n```{r, echo = FALSE, message = FALSE}\nlibrary(tibble)\n#knitr::opts_chunk$set(comment = \"\")\nhas_nld <- \"nld\" %in% tesseract::tesseract_info()$available\nif(identical(Sys.info()[['user']], 'jeroen')) stopifnot(has_nld)\nif(grepl('tesseract.Rcheck', getwd())){\n  Sys.sleep(10) #workaround for CPU time check\n}\n```\n\nThe tesseract package provides R bindings [Tesseract](https://github.com/tesseract-ocr/tesseract): a powerful optical character recognition (OCR) engine that supports over 100 languages. The engine is highly configurable in order to tune the detection algorithms and obtain the best possible results.\n\nKeep in mind that OCR (pattern recognition in general) is a very difficult problem for computers. Results will rarely be perfect and the accuracy rapidly decreases with the quality of the input image. But if you can get your input images to reasonable quality, Tesseract can often help to extract most of the text from the image.\n\n## Extract Text from Images\n\nOCR is the process of finding and recognizing text inside images, for example from a screenshot, scanned paper. The image below has some example text:\n\n![test](https://jeroen.github.io/images/testocr.png){data-external=1}\n\n```{r}\nlibrary(tesseract)\neng <- tesseract(\"eng\")\ntext <- tesseract::ocr(\"http://jeroen.github.io/images/testocr.png\", engine = eng)\ncat(text)\n```\n\nNot bad! The `ocr_data()` function returns all words in the image along with a bounding box and confidence rate.\n\n```{r}\nresults <- tesseract::ocr_data(\"http://jeroen.github.io/images/testocr.png\", engine = eng)\nresults\n```\n\n## Language Data\n\nThe tesseract OCR engine uses language-specific training data in the recognize words. The OCR algorithms bias towards words and sentences that frequently appear together in a given language, just like the human brain does. Therefore the most accurate results will be obtained when using training data in the correct language. \n\nUse `tesseract_info()` to list the languages that you currently have installed.\n\n```{r}\ntesseract_info()\n```\n\nBy default the R package only includes English training data. Windows and Mac users can install additional training data using `tesseract_download()`. Let's OCR a screenshot from Wikipedia in Dutch (Nederlands) \n\n[![utrecht](https://jeroen.github.io/images/utrecht2.png)](https://nl.wikipedia.org/wiki/Geschiedenis_van_de_stad_Utrecht)\n\n```{r, eval=FALSE}\n# Only need to do download once:\ntesseract_download(\"nld\")\n```\n\n```{r eval = has_nld}\n# Now load the dictionary\n(dutch <- tesseract(\"nld\"))\ntext <- ocr(\"https://jeroen.github.io/images/utrecht2.png\", engine = dutch)\ncat(text)\n```\n\nAs you can see immediately: almost perfect! (OK just take my word). \n\n\n## Preprocessing with Magick\n\nThe accuracy of the OCR process depends on the quality of the input image. You can often improve results by properly scaling the image, removing noise and artifacts or cropping the area where the text exists. See [tesseract wiki: improve quality](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality) for important tips to improve the quality of your input image.\n\nThe awesome [magick](https://cran.r-project.org/package=magick/vignettes/intro.html) R package has many useful functions that can be use for enhancing the quality of the image. Some things to try:\n\n - If your image is skewed, use `image_deskew()` and `image_rotate()` make the text horizontal.\n - `image_trim()` crops out whitespace in the margins. Increase the `fuzz` parameter to make it work for noisy whitespace.\n - Use `image_convert()` to turn the image into greyscale, which can reduce artifacts and enhance actual text.\n - If your image is very large or small resizing with `image_resize()` can help tesseract determine text size.\n - Use `image_modulate()` or `image_contrast()` or `image_contrast()` to tweak brightness / contrast if this is an issue.\n - Try `image_reducenoise()` for automated noise removal. Your mileage may vary.\n - With `image_quantize()` you can reduce the number of colors in the image. This can sometimes help with increasing contrast and reducing artifacts.\n - True imaging ninjas can use `image_convolve()` to use custom [convolution methods](https://ropensci.org/technotes/2017/11/02/image-convolve/). \n\nBelow is an example OCR scan. The code converts it to black-and-white and resizes + crops the image before feeding it to tesseract to get more accurate OCR results.\n\n![bowers](https://jeroen.github.io/images/bowers.jpg){data-external=1}\n\n\n```{r}\nlibrary(magick)\ninput <- image_read(\"https://jeroen.github.io/images/bowers.jpg\")\n\ntext <- input %>%\n  image_resize(\"2000x\") %>%\n  image_convert(type = 'Grayscale') %>%\n  image_trim(fuzz = 40) %>%\n  image_write(format = 'png', density = '300x300') %>%\n  tesseract::ocr() \n\ncat(text)\n```\n\n\n## Read from PDF files\n\nIf your images are stored in PDF files they first need to be converted to a proper image format. We can do this in R using the `pdf_convert` function from the pdftools package. Use a high DPI to keep quality of the image.\n\n```{r, eval=require(pdftools)}\npngfile <- pdftools::pdf_convert('https://jeroen.github.io/images/ocrscan.pdf', dpi = 600)\ntext <- tesseract::ocr(pngfile)\ncat(text)\n```\n\n\n## Tesseract Control Parameters\n\nTesseract supports hundreds of \"control parameters\" which alter the OCR engine. Use `tesseract_params()` to list all parameters with their default value and a brief description. It also has a handy `filter` argument to quickly find parameters that match a particular string.\n\n```{r}\n# List all parameters with *colour* in name or description\ntesseract_params('colour')\n```\n\nDo note that some of the control parameters have changed between Tesseract engine 3 and 4.\n\n```{r}\ntesseract::tesseract_info()['version']\n```\n\n### Whitelist / Blacklist characters\n\nOne powerful parameter is `tessedit_char_whitelist` which restricts the output to a limited set of characters. This may be useful for reading for example numbers such as a bank account, zip code, or gas meter.\n\nThe whitelist parameter works for all versions of Tesseract engine 3 and also engine versions 4.1 and higher, but unfortunately it did not work in Tesseract 4.0.\n\n\n![receipt](https://jeroen.github.io/images/receipt.png){data-external=1}\n\n```{r}\nnumbers <- tesseract(options = list(tessedit_char_whitelist = \"$.0123456789\"))\ncat(ocr(\"https://jeroen.github.io/images/receipt.png\", engine = numbers))\n```\n\nTo test if this actually works, look what happens if we remove the `$` from `tessedit_char_whitelist`:\n\n```{r}\n# Do not allow any dollar sign \nnumbers2 <- tesseract(options = list(tessedit_char_whitelist = \".0123456789\"))\ncat(ocr(\"https://jeroen.github.io/images/receipt.png\", engine = numbers2))\n```\n\n"
  }
]