Repository: ropensci/tesseract Branch: master Commit: eb79775ec4fd Files: 32 Total size: 60.3 KB Directory structure: gitextract_z8kqhkl4/ ├── .Rbuildignore ├── .github/ │ ├── .gitignore │ └── workflows/ │ └── R-CMD-check.yaml ├── .gitignore ├── DESCRIPTION ├── NAMESPACE ├── NEWS ├── R/ │ ├── RcppExports.R │ ├── ocr.R │ ├── onload.R │ ├── tessdata.R │ └── tesseract.R ├── README.md ├── cleanup ├── configure ├── configure.win ├── inst/ │ ├── AUTHORS │ ├── COPYRIGHT │ └── WORDLIST ├── man/ │ ├── ocr.Rd │ ├── tessdata.Rd │ └── tesseract.Rd ├── src/ │ ├── Makevars.in │ ├── Makevars.win │ ├── RcppExports.cpp │ ├── tesseract.cpp │ └── tesseract_types.h ├── tesseract.Rproj ├── tests/ │ └── spelling.R ├── tools/ │ ├── test.cpp │ └── winlibs.R └── vignettes/ └── intro.Rmd ================================================ FILE CONTENTS ================================================ ================================================ FILE: .Rbuildignore ================================================ ^.*\.Rproj$ ^\.Rproj\.user$ ^src/Makevars$ ^windows \.pdf$ \.png$ \.webp$ \.jpeg$ \.o$ \.dll$ ^\.travis\.yml$ ^appveyor\.yml$ ^README.md$ vignettes/.*\.png$ ^configure.log$ ^\.github$ ^\.deps$ ================================================ FILE: .github/.gitignore ================================================ *.html ================================================ FILE: .github/workflows/R-CMD-check.yaml ================================================ # Workflow derived from https://github.com/r-lib/actions/tree/v2/examples # Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help on: push: pull_request: name: R-CMD-check.yaml permissions: read-all jobs: R-CMD-check: runs-on: ${{ matrix.config.os }} name: ${{ matrix.config.os }} (${{ matrix.config.r }}) strategy: fail-fast: false matrix: config: - {os: macos-15-intel, r: 'release'} - {os: macos-latest, r: 'next'} - {os: windows-latest , r: '4.1'} - {os: windows-latest , r: '4.2'} - {os: windows-latest , r: '4.3'} - {os: windows-latest , r: '4.4'} - {os: windows-latest , r: 'devel'} - {os: ubuntu-latest, r: 'devel', http-user-agent: 'release'} - {os: ubuntu-latest, r: 'release'} - {os: ubuntu-latest, r: 'oldrel-4'} env: GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} R_KEEP_PKG_SOURCE: yes steps: - uses: actions/checkout@v4 - uses: r-lib/actions/setup-pandoc@v2 - uses: r-lib/actions/setup-r@v2 with: r-version: ${{ matrix.config.r }} http-user-agent: ${{ matrix.config.http-user-agent }} use-public-rspm: true - uses: r-lib/actions/setup-r-dependencies@v2 with: extra-packages: any::rcmdcheck needs: check - uses: r-lib/actions/check-r-package@v2 env: MAKEFLAGS: -j4 ================================================ FILE: .gitignore ================================================ *.o *.so *.dll *.a *.txt *.pdf *.png *.webp *.jpeg .Rproj.user .Rhistory inst/tessdata windows src/Makevars configure.log ================================================ FILE: DESCRIPTION ================================================ Package: tesseract Type: Package Title: Open Source OCR Engine Version: 5.2.5 Authors@R: person("Jeroen", "Ooms", role = c("aut", "cre"), email = "jeroenooms@gmail.com", comment = c(ORCID = "0000-0002-4035-0289")) Description: Bindings to 'Tesseract': a powerful optical character recognition (OCR) engine that supports over 100 languages. The engine is highly configurable in order to tune the detection algorithms and obtain the best possible results. License: Apache License 2.0 URL: https://docs.ropensci.org/tesseract/ https://ropensci.r-universe.dev/tesseract BugReports: https://github.com/ropensci/tesseract/issues SystemRequirements: Tesseract >= 3.03 (libtesseract-dev / tesseract-devel) and Leptonica (libleptonica-dev / leptonica-devel). On Debian you need to install the English training data separately (tesseract-ocr-eng) Imports: Rcpp (>= 0.12.12), pdftools (>= 1.5), curl, rappdirs, digest LinkingTo: Rcpp RoxygenNote: 7.3.3 Roxygen: list(markdown = TRUE) Suggests: magick (>= 1.7), spelling, knitr, tibble, rmarkdown Encoding: UTF-8 VignetteBuilder: knitr Language: en-US ================================================ FILE: NAMESPACE ================================================ # Generated by roxygen2: do not edit by hand S3method(print,tesseract) export(ocr) export(ocr_data) export(tesseract) export(tesseract_download) export(tesseract_info) export(tesseract_params) importFrom(Rcpp,sourceCpp) useDynLib(tesseract) ================================================ FILE: NEWS ================================================ 5.2.5 - Wrap examples in donttest for cran policies 5.2.4 - Do not use CXX11 anymore in configure script (fixes R-4.6) 5.2.1 - Fix shell script for cross compilation 5.2.0 - Windows: update to tesseract 5.3.2 5.1.0 - Win: update to tesseract 5.1.0. - Win: apply patch for freezes when running under UTF-8 in R-4.2. See: https://github.com/tesseract-ocr/tesseract/issues/3830 5.0.0 - Win/Mac: update to libtesseract 5.0.1 - Remove locale workaround on libtesseract 4.1+ (should only be needed for 4.0) - Remove cruft that was needed to support Solaris 4.2.0 - Prepare for API changes in upcoming Tesseract 5 release - Change the default language="eng" in tesseract() 4.1.2 - Fix for upstream master/main renames in language repos 4.1.1 - Win/Mac: update to libtesseract 4.1.1 4.1 - Fix memory leak in ocr_data() - Windows / MacOS: update to libtesseract 4.1.0. This re-enables the whitelist/blacklist options that were missing in Tesseract 4.0 4.0 - Windows, MacOS: Upgrade to upstream Tesseract 4.0! Completely new OCR engine. - Tesseract 4 has a new training data format. On Windows / MacOS you need to re-download your language data with tesseract_download(). The package uses separate directories for storing Tesseract 3 vs 4 data so they shouldn't get mixed up (hopefully). - Drop hard-dependency on tibble (only load if available) 2.3 - Fix problem with setlocale() not properly restoring locale. - Switch examples from dontrun{} to donttest{}, and '--run-donttest' on travis/appveyor 2.2 - Fixes for breaking changes in Tesseract 4.0.0 beta.3 - Set LC_ALL = C when initiating tesseract - Include to support Tesseract 4 2.1 - Fixes for 4.0.0-beta.1: they switched to semver + other data branch - Set LC_CTYPE to "C" when loading training data (required for some asian languages) - Add back OSD training data on Windows 2.0 - Set tesseract parameters at init so that all parameters types now actually work! - New function tesseract_params() lists all supported parameters and their default - Added 'config' argument to tesseract() which specifies a file with parameter values - Internally validate paremeter names before init to revent tesseract crashes - Rewrite the ocr_data() function in C++ to make it much faster - Tesseract 4 now gets data from the tessdata_fast repo as recommended upstream - Use default resolution of 300dpi when image does not contain resolution info 1.9 - Tesseract 4 now dowloads training data from the "tessdata_fast" repo - Add ocr_data() function that parses the hOCR output 1.8 - Add support for HOCR output (#20) - Remove 'script' and 'orientation' attributes in output (doesn't work in Tesseract 4) 1.7 (internal) - Add support upcoming Tesseract 4 (compiler fix + separate tessdata dir) - Configure script now explicitly tests for CXX11 (required by Tesseract 4) 1.6 - Windows: update libtesseract to 3.05.01 - tesseract_download now uses 3.04 tree (instead of 4.00) as suggested in readme - For static packags on Win/Mac, languages stored in: rappdirs::user_data_dir('tesseract') - Use 'png' instead of 'tiff' to read magick images - Compile with $(C_VISIBILITY) to hide internal symbols (requires Rcpp 0.12.12) - Use Rcpp symbol registration 1.4 - Run engine finalizer on R exit (requires Rcpp 0.12.10) - Move autobrew script to separate repository - Add symbol registration 1.3 - tesseract() gains an 'options' parameter for setting engine variables - New tessseract_download() function for installing training data on Win/Mac - Initiate default tesseract engine onAttach() to fail for missing training data - Add support for ocr() on magick images 1.2 - Try to fix build for CRAN OS-X, again. 1.1 - Try to fix build for CRAN OS-X build server - Show 'loaded' and 'available' languages in print.tesseract() 1.0 - Initial CRAN release ================================================ FILE: R/RcppExports.R ================================================ # Generated by using Rcpp::compileAttributes() -> do not edit by hand # Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393 tesseract_config <- function() { .Call('_tesseract_tesseract_config', PACKAGE = 'tesseract') } tesseract_engine_internal <- function(datapath, language, confpaths, opt_names, opt_values) { .Call('_tesseract_tesseract_engine_internal', PACKAGE = 'tesseract', datapath, language, confpaths, opt_names, opt_values) } tesseract_engine_set_variable <- function(ptr, name, value) { .Call('_tesseract_tesseract_engine_set_variable', PACKAGE = 'tesseract', ptr, name, value) } validate_params <- function(params) { .Call('_tesseract_validate_params', PACKAGE = 'tesseract', params) } engine_info_internal <- function(ptr) { .Call('_tesseract_engine_info_internal', PACKAGE = 'tesseract', ptr) } print_params <- function(filename) { .Call('_tesseract_print_params', PACKAGE = 'tesseract', filename) } get_param_values <- function(ptr, params) { .Call('_tesseract_get_param_values', PACKAGE = 'tesseract', ptr, params) } ocr_raw <- function(input, ptr, HOCR = FALSE) { .Call('_tesseract_ocr_raw', PACKAGE = 'tesseract', input, ptr, HOCR) } ocr_file <- function(file, ptr, HOCR = FALSE) { .Call('_tesseract_ocr_file', PACKAGE = 'tesseract', file, ptr, HOCR) } ocr_raw_data <- function(input, ptr) { .Call('_tesseract_ocr_raw_data', PACKAGE = 'tesseract', input, ptr) } ocr_file_data <- function(file, ptr) { .Call('_tesseract_ocr_file_data', PACKAGE = 'tesseract', file, ptr) } ================================================ FILE: R/ocr.R ================================================ #' Tesseract OCR #' #' Extract text from an image. Requires that you have training data for the language you #' are reading. Works best for images with high contrast, little noise and horizontal text. #' See [tesseract wiki](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality) and #' our package vignette for image preprocessing tips. #' #' The `ocr()` function returns plain text by default, or hOCR text if hOCR is set to `TRUE`. #' The `ocr_data()` function returns a data frame with a confidence rate and bounding box for #' each word in the text. #' #' @export #' @useDynLib tesseract #' @family tesseract #' @param image file path, url, or raw vector to image (png, tiff, jpeg, etc) #' @param engine a tesseract engine created with [tesseract()]. Alternatively a #' language string which will be passed to [tesseract()]. #' @param HOCR if `TRUE` return results as HOCR xml instead of plain text #' @rdname ocr #' @references [Tesseract: Improving Quality](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality) #' @importFrom Rcpp sourceCpp #' @examples \donttest{ #' text <- ocr("https://jeroen.github.io/images/testocr.png") #' cat(text) #' #' xml <- ocr("https://jeroen.github.io/images/testocr.png", HOCR = TRUE) #' cat(xml) #' #' df <- ocr_data("https://jeroen.github.io/images/testocr.png") #' print(df) #' #' # Full roundtrip test: render PDF to image and OCR it back to text #' curl::curl_download("https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf", "R-intro.pdf") #' orig <- pdftools::pdf_text("R-intro.pdf")[1] #' #' # Render pdf to png image #' img_file <- pdftools::pdf_convert("R-intro.pdf", format = 'tiff', pages = 1, dpi = 400) #' unlink("R-intro.pdf") #' #' # Extract text from png image #' text <- ocr(img_file) #' unlink(img_file) #' cat(text) #' } #' #' engine <- tesseract(options = list(tessedit_char_whitelist = "0123456789")) ocr <- function(image, engine = tesseract("eng"), HOCR = FALSE) { if(is.character(engine)) engine <- tesseract(engine) stopifnot(inherits(engine, "tesseract")) if(inherits(image, "magick-image")){ vapply(image, function(x){ tmp <- tempfile(fileext = ".png") on.exit(unlink(tmp)) magick::image_write(x, tmp, format = 'PNG', density = '300x300') ocr(tmp, engine = engine, HOCR = HOCR) }, character(1)) } else if(is.character(image)){ image <- download_files(image) vapply(image, ocr_file, character(1), ptr = engine, HOCR = HOCR, USE.NAMES = FALSE) } else if(is.raw(image)){ ocr_raw(image, engine, HOCR = HOCR) } else { stop("Argument 'image' must be file-path, url or raw vector") } } #' @rdname ocr #' @export ocr_data <- function(image, engine = tesseract("eng")) { if(is.character(engine)) engine <- tesseract(engine) stopifnot(inherits(engine, "tesseract")) df_list <- if(inherits(image, "magick-image")){ lapply(image, function(x){ tmp <- tempfile(fileext = ".png") on.exit(unlink(tmp)) magick::image_write(x, tmp, format = 'PNG', density = '300x300') ocr_data(tmp, engine = engine) }) } else if(is.character(image)){ image <- download_files(image) lapply(image, function(im){ ocr_file_data(im, ptr = engine) }) } else if(is.raw(image)){ list(ocr_raw_data(image, engine)) } else { stop("Argument 'image' must be file-path, url or raw vector") } df_as_tibble(do.call(rbind.data.frame, unname(df_list))) } ================================================ FILE: R/onload.R ================================================ .onLoad <- function(lib, pkg){ pkgdir <- file.path(lib, pkg) version <- tesseract_version_major() appname <- ifelse(version < 4, "tesseract", paste0("tesseract", version)) sysdir <- rappdirs::user_data_dir(appname) pkgdata <- normalizePath(file.path(pkgdir, "tessdata"), mustWork = FALSE) sysdata <- normalizePath(file.path(sysdir, "tessdata"), mustWork = FALSE) if(!is_testload() && file.exists(pkgdata) && !file.exists(file.path(sysdata, "eng.traineddata"))){ dir.create(sysdir, showWarnings = FALSE, recursive = TRUE) if(file.exists(sysdir)){ onload_notify() olddir <- getwd() on.exit(setwd(olddir)) setwd(pkgdir) file.copy("tessdata", sysdir, recursive = TRUE) } } if(is.na(Sys.getenv("TESSDATA_PREFIX", NA))){ if(file.exists(file.path(sysdata, "eng.traineddata"))){ Sys.setenv(TESSDATA_PREFIX = sysdata) } else if(file.exists(file.path(pkgdata, "eng.traineddata"))){ Sys.setenv(TESSDATA_PREFIX = pkgdata) } } if(grepl('tesseract.Rcheck', getwd(), fixed = TRUE)){ Sys.setenv(OMP_THREAD_LIMIT=2) Sys.setenv(OMP_NUM_THREADS=2) } } tesseract_version_major <- function(){ as.numeric(substring(tesseract_config()$version, 1, 1)) } onload_notify <- function(){ message("First use of Tesseract: copying language data...\n") } is_testload <- function(){ as.logical(nchar(Sys.getenv("R_INSTALL_PKG"))) } .onUnload <- function(lib){ Sys.unsetenv("TESSDATA_PREFIX") } .onAttach <- function(lib, pkg){ check_training_data() # Load tibble (if available) for pretty printing if(interactive() && is.null(.getNamespace('tibble'))){ tryCatch({ getNamespace('tibble') }, error= function(e){}) } } check_training_data <- function(){ tryCatch(tesseract(), error = function(e){ warning("Unable to find English training data", call. = FALSE) os <- utils::sessionInfo()$running if(isTRUE(grepl("ubuntu|debian", os, TRUE))){ stop("DEBIAN / UBUNTU: Please run: apt-get install tesseract-ocr-eng") } }) } ================================================ FILE: R/tessdata.R ================================================ #' Tesseract Training Data #' #' Helper function to download training data from the official #' [tessdata](https://tesseract-ocr.github.io/tessdoc/Data-Files) repository. On Linux, the fast training data can be installed directly with #' [yum](https://src.fedoraproject.org/rpms/tesseract) or #' [apt-get](https://packages.debian.org/search?suite=stable§ion=all&arch=any&searchon=names&keywords=tesseract-ocr-). #' #' Tesseract uses training data to perform OCR. Most systems default to English #' training data. To improve OCR performance for other languages you can to install the #' training data from your distribution. For example to install the spanish training data: #' #' - [tesseract-ocr-spa](https://packages.debian.org/testing/tesseract-ocr-spa) (Debian, Ubuntu) #' - `tesseract-langpack-spa` (Fedora, EPEL) #' #' On Windows and MacOS you can install languages using the [tesseract_download] function #' which downloads training data directly from [github](https://github.com/tesseract-ocr/tessdata) #' and stores it in a the path on disk given by the `TESSDATA_PREFIX` variable. #' #' @export #' @aliases tessdata #' @rdname tessdata #' @family tesseract #' @param lang three letter code for language, see [tessdata](https://github.com/tesseract-ocr/tessdata) repository. #' @param datapath destination directory where to download store the file #' @param model either `fast` or `best` is currently supported. The latter downloads #' more accurate (but slower) trained models for Tesseract 4.0 or higher #' @param progress print progress while downloading #' @references [tesseract wiki: training data](https://tesseract-ocr.github.io/tessdoc/Data-Files) #' @examples \dontrun{ #' if(is.na(match("fra", tesseract_info()$available))) #' tesseract_download("fra", model = 'best') #' french <- tesseract("fra") #' text <- ocr("https://jeroen.github.io/images/french_text.png", engine = french) #' cat(text) #' } tesseract_download <- function(lang, datapath = NULL, model = c("fast", "best"), progress = interactive()) { stopifnot(is.character(lang)) model <- match.arg(model) if(!length(datapath)){ warn_on_linux() datapath <- tesseract_info()$datapath } datapath <- normalizePath(datapath, mustWork = TRUE) version <- tesseract_version_major() if(version < 4){ repo <- "tessdata" release <- "3.04.00" } else { repo <- paste0("tessdata_", model) release <- "4.1.0" } url <- sprintf("https://github.com/tesseract-ocr/%s/raw/%s/%s.traineddata", repo, release, lang) destfile <- file.path(datapath, basename(url)) if (file.exists(destfile)) { message(paste("Training data already exists. Overwriting", destfile)) } req <- curl::curl_fetch_memory(url, curl::new_handle( progressfunction = progress_fun, noprogress = !isTRUE(progress) )) if(progress) cat("\n") if(req$status_code != 200) stop("Download failed: HTTP ", req$status_code, call. = FALSE) writeBin(req$content, destfile) return(destfile) } progress_fun <- function(down, up) { total <- down[[1]] now <- down[[2]] pct <- if(length(total) && total > 0){ paste0("(", round(now/total * 100), "%)") } else { "" } if(now > 10000) cat("\r Downloaded:", sprintf("%.2f", now / 2^20), "MB ", pct) TRUE } warn_on_linux <- function(){ if(identical(.Platform$OS.type, "unix") && !identical(Sys.info()[["sysname"]], "Darwin")){ warning("On Linux you should install training data via yum/apt. Please check the manual page.", call. = FALSE) } } ================================================ FILE: R/tesseract.R ================================================ #' Tesseract Engine #' #' Create an OCR engine for a given language and control parameters. This can be used by #' the [ocr] and [ocr_data] functions to recognize text. #' #' Tesseract control parameters can be set either via a named list in the #' `options` parameter, or in a `config` file text file which contains the parameter name #' followed by a space and then the value, one per line. Use [tesseract_params()] to list #' or find parameters. Note that that some parameters are only supported in certain versions #' of libtesseract, and that invalid parameters can sometimes cause libtesseract to crash. #' #' @export #' @rdname tesseract #' @family tesseract #' @param language string with language for training data. Usually defaults to `eng` #' @param datapath path with the training data for this language. Default uses #' the system library. #' @param configs character vector with files, each containing one or more parameter #' values. These config files can exist in the current directory or one of the standard #' tesseract config files that live in the tessdata directory. See details. #' @param options a named list with tesseract parameters. See details. #' @param cache speed things up by caching engines tesseract <- local({ store <- new.env() function(language = "eng", datapath = NULL, configs = NULL, options = NULL, cache = TRUE){ datapath <- normalizePath(as.character(datapath), mustWork = TRUE) language <- as.character(language) configs <- as.character(configs) options <- as.list(options) if(isTRUE(cache)){ key <- digest::digest(list(language, datapath, configs, options)) if(is.null(store[[key]])){ ptr <- tesseract_engine(datapath, language, configs, options) assign(key, ptr, store); } store[[key]] } else { tesseract_engine(datapath, language, configs, options) } } }) #' @export #' @rdname tesseract #' @param filter only list parameters containing a particular string #' @examples tesseract_params('debug') tesseract_params <- function(filter = ""){ tmp <- print_params(tempfile()) on.exit(unlink(tmp)) df <- parse_params(tmp) subset <- grepl(filter, paste(df$param, df$desc), ignore.case = TRUE) df_as_tibble(df[subset,]) } #' @export #' @rdname tesseract tesseract_info <- function(){ info <- engine_info_internal(tesseract()) config <- tesseract_config() list(datapath = info$datapath, available = info$available, version = config$version, configs = list.files(file.path(info$datapath, "configs"))) } parse_params <- function(path){ utils::read.delim(path, header = FALSE, quote = "", col.names = c("param", "default", "desc"), stringsAsFactors = FALSE) } tesseract_engine <- function(datapath, language, configs, options){ # Tesseract::read_config_file first checks for local file, then in tessdata lapply(configs, function(confpath){ if(file.exists(confpath)){ params <- tryCatch(utils::read.table(confpath, quote = ""), error = function(e){ bail("Failed to parse config file '%s': %s", confpath, e$message) }) ok <- validate_params(params$V1) if(any(!ok)) bail("Unsupported Tesseract parameter(s): [%s] in %s", paste(params$V1[!ok], collapse = ", "), confpath) } }) opt_names <- as.character(names(options)) opt_values <- as.character(options) ok <- validate_params(opt_names) if(any(!ok)) bail("Unsupported Tesseract parameter(s): [%s]", paste(opt_names[!ok], collapse = ", ")) tesseract_engine_internal(datapath, language, configs, opt_names, opt_values) } download_files <- function(urls){ files <- vapply(urls, function(path){ if(grepl("^https?://", path)){ tmp <- tempfile(fileext = basename(path)) curl::curl_download(path, tmp) path <- tmp } normalizePath(path, mustWork = TRUE) }, character(1)) is_pdf <- grepl(".pdf$", files) out <- unlist(lapply(files[is_pdf], function(path){ pdftools::pdf_convert(path, dpi = 600) })) c(files[!is_pdf], out) } #' @export "print.tesseract" <- function(x, ...){ info <- engine_info_internal(x) cat("\n") cat(" loaded:", info$loaded, "\n") cat(" datapath:", info$datapath, "\n") cat(" available:", info$available, "\n") } bail <- function(...){ stop(sprintf(...), call. = FALSE) } df_as_tibble <- function(df){ stopifnot(is.data.frame(df)) class(df) <- c("tbl_df", "tbl", "data.frame") df } ================================================ FILE: README.md ================================================ # tesseract > Bindings to [Tesseract-OCR](https://opensource.google/projects/tesseract): a powerful optical character recognition (OCR) engine that supports over 100 languages. The engine is highly configurable in order to tune the detection algorithms and obtain the best possible results. [![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](http://www.repostatus.org/#active) [![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/tesseract)](https://cran.r-project.org/package=tesseract) [![CRAN RStudio mirror downloads](http://cranlogs.r-pkg.org/badges/tesseract)](https://cran.r-project.org/package=tesseract) - Upstream Tesseract-OCR documentation: https://tesseract-ocr.github.io/tessdoc/ - Introduction: https://docs.ropensci.org/tesseract/articles/intro.html - Reference: https://docs.ropensci.org/tesseract/reference/ocr.html ## Hello World Simple example ```r # Simple example text <- ocr("https://jeroen.github.io/images/testocr.png") cat(text) # Get XML HOCR output xml <- ocr("https://jeroen.github.io/images/testocr.png", HOCR = TRUE) cat(xml) ``` Roundtrip test: render PDF to image and OCR it back to text ```r # Full roundtrip test: render PDF to image and OCR it back to text curl::curl_download("https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf", "R-intro.pdf") orig <- pdftools::pdf_text("R-intro.pdf")[1] # Render pdf to png image img_file <- pdftools::pdf_convert("R-intro.pdf", format = 'tiff', pages = 1, dpi = 400) # Extract text from png image text <- ocr(img_file) unlink(img_file) cat(text) ``` ## Installation On Windows and MacOS the package binary package can be installed from CRAN: ```r install.packages("tesseract") ``` Installation from source on Linux or OSX requires the `Tesseract` library (see below). ### Install from source On __Debian__ or __Ubuntu__ install [libtesseract-dev](https://packages.debian.org/testing/libtesseract-dev) and [libleptonica-dev](https://packages.debian.org/testing/libleptonica-dev). Also install [tesseract-ocr-eng](https://packages.debian.org/testing/tesseract-ocr-eng) to run examples. ``` sudo apt-get install -y libtesseract-dev libleptonica-dev tesseract-ocr-eng ``` On __Ubuntu__ you can optionally use [this PPA](https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr-devel) to get the latest version of Tesseract: ``` sudo add-apt-repository ppa:alex-p/tesseract-ocr-devel sudo apt-get install -y libtesseract-dev tesseract-ocr-eng ``` On __Fedora__ we need [tesseract-devel](https://src.fedoraproject.org/rpms/tesseract) and [leptonica-devel](https://src.fedoraproject.org/rpms/leptonica) ``` sudo yum install tesseract-devel leptonica-devel ```` On __RHEL__ and __CentOS__ we need [tesseract-devel](https://src.fedoraproject.org/rpms/tesseract) and [leptonica-devel](https://src.fedoraproject.org/rpms/leptonica) from EPEL ``` sudo yum install epel-release sudo yum install tesseract-devel leptonica-devel ```` On __OS-X__ use [tesseract](https://github.com/Homebrew/homebrew-core/blob/master/Formula/tesseract.rb) from Homebrew: ``` brew install tesseract ``` Tesseract uses training data to perform OCR. Most systems default to English training data. To improve OCR results for other languages you can to install the appropriate training data. On Windows and OSX you can do this in R using `tesseract_download()`: ```r tesseract_download('fra') ``` On Linux you need to install the appropriate training data from your distribution. For example to install the spanish training data: - [tesseract-ocr-spa](https://packages.debian.org/testing/tesseract-ocr-spa) (Debian, Ubuntu) - [tesseract-langpack-spa](https://src.fedoraproject.org/rpms/tesseract-langpack) (Fedora, EPEL) Alternatively you can manually download training data from [github](https://github.com/tesseract-ocr/tessdata) and store it in a path on disk that you pass in the `datapath` parameter or set a default path via the `TESSDATA_PREFIX` environment variable. Note that the Tesseract 4 and Tesseract 3 use different training data format. Make sure to download training data from the branch that matches your libtesseract version. ================================================ FILE: cleanup ================================================ #!/bin/sh rm -f src/Makevars configure.log autobrew ================================================ FILE: configure ================================================ # Anticonf (tm) script by Jeroen Ooms (2022) # This script will query 'pkg-config' for the required cflags and ldflags. # If pkg-config is unavailable or does not find the library, try setting # INCLUDE_DIR and LIB_DIR manually via e.g: # R CMD INSTALL --configure-vars='INCLUDE_DIR=/.../include LIB_DIR=/.../lib' # Library settings PKG_CONFIG_NAME="tesseract" PKG_DEB_NAME="libtesseract-dev libleptonica-dev" PKG_RPM_NAME="tesseract-devel leptonica-devel" PKG_BREW_NAME="tesseract" PKG_TEST_HEADER="" PKG_CFLAGS="-I/usr/include/tesseract -I/usr/include/leptonica" PKG_LIBS="-ltesseract" # Use pkg-config if available pkg-config --version >/dev/null 2>&1 if [ $? -eq 0 ]; then PKGCONFIG_CFLAGS=`pkg-config --cflags --silence-errors ${PKG_CONFIG_NAME}` PKGCONFIG_LIBS=`pkg-config --libs ${PKG_CONFIG_NAME}` fi # Note that cflags may be empty in case of success if [ "$INCLUDE_DIR" ] || [ "$LIB_DIR" ]; then echo "Found INCLUDE_DIR and/or LIB_DIR!" PKG_CFLAGS="-I$INCLUDE_DIR $PKG_CFLAGS" PKG_LIBS="-L$LIB_DIR $PKG_LIBS" elif [ "$PKGCONFIG_CFLAGS" ] || [ "$PKGCONFIG_LIBS" ]; then echo "Found pkg-config cflags and libs!" PKG_CFLAGS=${PKGCONFIG_CFLAGS} PKG_LIBS=${PKGCONFIG_LIBS} elif [ `uname` = "Darwin" ]; then test ! "$CI" && brew --version 2>/dev/null if [ $? -eq 0 ]; then BREWDIR=`brew --prefix` PKG_CFLAGS="-I$BREWDIR/include/tesseract -I$BREWDIR/include/leptonica" PKG_LIBS="-L$BREWDIR/lib $PKG_LIBS" else curl -sfL "https://autobrew.github.io/scripts/tesseract" > autobrew . ./autobrew fi fi # For debugging echo "Using PKG_CFLAGS=$PKG_CFLAGS" echo "Using PKG_LIBS=$PKG_LIBS" # Find compiler CXX=`${R_HOME}/bin/R CMD config CXX` CPPFLAGS=`${R_HOME}/bin/R CMD config CPPFLAGS` # Test configuration echo "Using CXX: ${CXX}" ${CXX} -E ${CPPFLAGS} ${PKG_CFLAGS} tools/test.cpp >/dev/null 2>configure.log # Customize the error if [ $? -ne 0 ]; then echo "--------------------------- [ANTICONF] --------------------------------" echo "Configuration failed to find '$PKG_CONFIG_NAME' system library. Try installing:" echo " * deb: $PKG_DEB_NAME (Debian, Ubuntu, etc)" echo " * rpm: $PKG_RPM_NAME (Fedora, CentOS, RHEL)" echo " * brew: $PKG_BREW_NAME (Mac OSX)" echo "If $PKG_CONFIG_NAME is already installed, check that 'pkg-config' is in your" echo "PATH and PKG_CONFIG_PATH contains a $PKG_CONFIG_NAME.pc file. If pkg-config" echo "is unavailable you can set INCLUDE_DIR and LIB_DIR manually via:" echo "R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'" echo "-------------------------- [ERROR MESSAGE] ---------------------------" cat configure.log echo "--------------------------------------------------------------------" exit 1 fi # Write to Makevars sed -e "s|@cflags@|$PKG_CFLAGS|" -e "s|@libs@|$PKG_LIBS|" src/Makevars.in > src/Makevars # Success exit 0 ================================================ FILE: configure.win ================================================ ================================================ FILE: inst/AUTHORS ================================================ Authors of upstream tesseract library and training data: Ray Smith (lead developer) Ahmad Abdulkader Rika Antonova Nicholas Beato Jeff Breidenbach Samuel Charron Phil Cheatle Simon Crouch David Eger Sheelagh Huddleston Dan Johnson Rajesh Katikam Thomas Kielbus Dar-Shyang Lee Zongyi (Joe) Liu Robert Moss Chris Newton Michael Reimer Marius Renn Raquel Romano Christy Russon Shobhit Saxena Mark Seaman Faisal Shafait Hiroshi Takenaka Ranjith Unnikrishnan Joern Wanke Ping Ping Xiu Andrew Ziem Oscar Zuniga Community Contributors: Zdenko Podobný (Maintainer) Jim Regan (Maintainer) James R Barlow Amit Dovev Martin Ettl Tom Morris Tobias Müller Egor Pugin Sundar M. Vaidya Stefan Weil ================================================ FILE: inst/COPYRIGHT ================================================ The package includes machine-generated training data which is released by Tesseract developers under Apache 2.0 license. Both data and license are available from: https://github.com/tesseract-ocr/tessdata ================================================ FILE: inst/WORDLIST ================================================ config EPEL github greyscale hOCR HOCR https jpeg knitr langpack libtesseract MacOS magick Magick Nederlands ocr opensource pdftools png rmarkdown spanish tessdata toc utrecht VignetteEncoding VignetteEngine VignetteIndexEntry ================================================ FILE: man/ocr.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/ocr.R \name{ocr} \alias{ocr} \alias{ocr_data} \title{Tesseract OCR} \usage{ ocr(image, engine = tesseract("eng"), HOCR = FALSE) ocr_data(image, engine = tesseract("eng")) } \arguments{ \item{image}{file path, url, or raw vector to image (png, tiff, jpeg, etc)} \item{engine}{a tesseract engine created with \code{\link[=tesseract]{tesseract()}}. Alternatively a language string which will be passed to \code{\link[=tesseract]{tesseract()}}.} \item{HOCR}{if \code{TRUE} return results as HOCR xml instead of plain text} } \description{ Extract text from an image. Requires that you have training data for the language you are reading. Works best for images with high contrast, little noise and horizontal text. See \href{https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality}{tesseract wiki} and our package vignette for image preprocessing tips. } \details{ The \code{ocr()} function returns plain text by default, or hOCR text if hOCR is set to \code{TRUE}. The \code{ocr_data()} function returns a data frame with a confidence rate and bounding box for each word in the text. } \examples{ \donttest{ text <- ocr("https://jeroen.github.io/images/testocr.png") cat(text) xml <- ocr("https://jeroen.github.io/images/testocr.png", HOCR = TRUE) cat(xml) df <- ocr_data("https://jeroen.github.io/images/testocr.png") print(df) # Full roundtrip test: render PDF to image and OCR it back to text curl::curl_download("https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf", "R-intro.pdf") orig <- pdftools::pdf_text("R-intro.pdf")[1] # Render pdf to png image img_file <- pdftools::pdf_convert("R-intro.pdf", format = 'tiff', pages = 1, dpi = 400) unlink("R-intro.pdf") # Extract text from png image text <- ocr(img_file) unlink(img_file) cat(text) } engine <- tesseract(options = list(tessedit_char_whitelist = "0123456789")) } \references{ \href{https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality}{Tesseract: Improving Quality} } \seealso{ Other tesseract: \code{\link{tesseract}()}, \code{\link{tesseract_download}()} } \concept{tesseract} ================================================ FILE: man/tessdata.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/tessdata.R \name{tesseract_download} \alias{tesseract_download} \alias{tessdata} \title{Tesseract Training Data} \usage{ tesseract_download( lang, datapath = NULL, model = c("fast", "best"), progress = interactive() ) } \arguments{ \item{lang}{three letter code for language, see \href{https://github.com/tesseract-ocr/tessdata}{tessdata} repository.} \item{datapath}{destination directory where to download store the file} \item{model}{either \code{fast} or \code{best} is currently supported. The latter downloads more accurate (but slower) trained models for Tesseract 4.0 or higher} \item{progress}{print progress while downloading} } \description{ Helper function to download training data from the official \href{https://tesseract-ocr.github.io/tessdoc/Data-Files}{tessdata} repository. On Linux, the fast training data can be installed directly with \href{https://src.fedoraproject.org/rpms/tesseract}{yum} or \href{https://packages.debian.org/search?suite=stable§ion=all&arch=any&searchon=names&keywords=tesseract-ocr-}{apt-get}. } \details{ Tesseract uses training data to perform OCR. Most systems default to English training data. To improve OCR performance for other languages you can to install the training data from your distribution. For example to install the spanish training data: \itemize{ \item \href{https://packages.debian.org/testing/tesseract-ocr-spa}{tesseract-ocr-spa} (Debian, Ubuntu) \item \code{tesseract-langpack-spa} (Fedora, EPEL) } On Windows and MacOS you can install languages using the \link{tesseract_download} function which downloads training data directly from \href{https://github.com/tesseract-ocr/tessdata}{github} and stores it in a the path on disk given by the \code{TESSDATA_PREFIX} variable. } \examples{ \dontrun{ if(is.na(match("fra", tesseract_info()$available))) tesseract_download("fra", model = 'best') french <- tesseract("fra") text <- ocr("https://jeroen.github.io/images/french_text.png", engine = french) cat(text) } } \references{ \href{https://tesseract-ocr.github.io/tessdoc/Data-Files}{tesseract wiki: training data} } \seealso{ Other tesseract: \code{\link{ocr}()}, \code{\link{tesseract}()} } \concept{tesseract} ================================================ FILE: man/tesseract.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/tesseract.R \name{tesseract} \alias{tesseract} \alias{tesseract_params} \alias{tesseract_info} \title{Tesseract Engine} \usage{ tesseract( language = "eng", datapath = NULL, configs = NULL, options = NULL, cache = TRUE ) tesseract_params(filter = "") tesseract_info() } \arguments{ \item{language}{string with language for training data. Usually defaults to \code{eng}} \item{datapath}{path with the training data for this language. Default uses the system library.} \item{configs}{character vector with files, each containing one or more parameter values. These config files can exist in the current directory or one of the standard tesseract config files that live in the tessdata directory. See details.} \item{options}{a named list with tesseract parameters. See details.} \item{cache}{speed things up by caching engines} \item{filter}{only list parameters containing a particular string} } \description{ Create an OCR engine for a given language and control parameters. This can be used by the \link{ocr} and \link{ocr_data} functions to recognize text. } \details{ Tesseract control parameters can be set either via a named list in the \code{options} parameter, or in a \code{config} file text file which contains the parameter name followed by a space and then the value, one per line. Use \code{\link[=tesseract_params]{tesseract_params()}} to list or find parameters. Note that that some parameters are only supported in certain versions of libtesseract, and that invalid parameters can sometimes cause libtesseract to crash. } \examples{ tesseract_params('debug') } \seealso{ Other tesseract: \code{\link{ocr}()}, \code{\link{tesseract_download}()} } \concept{tesseract} ================================================ FILE: src/Makevars.in ================================================ PKG_CPPFLAGS=@cflags@ PKG_LIBS=@libs@ PKG_CXXFLAGS=$(CXX_VISIBILITY) all: $(SHLIB) cleanup cleanup: $(SHLIB) @rm -Rf ../.deps ================================================ FILE: src/Makevars.win ================================================ RWINLIB = ../.deps/tesseract PKG_CPPFLAGS = -I$(RWINLIB)/include -I$(RWINLIB)/include/leptonica PKG_LIBS = \ -L$(RWINLIB)/lib${subst gcc,,${COMPILED_BY}}${R_ARCH} \ -L$(RWINLIB)/lib \ -ltesseract -lleptonica \ -ltiff -lopenjp2 -lwebp -lsharpyuv -ljpeg -lgif -lpng16 -lz \ -lws2_32 all: $(SHLIB) cleanup # Needed for parallel make $(OBJECTS): | $(RWINLIB) $(RWINLIB): @"${R_HOME}/bin${R_ARCH_BIN}/Rscript.exe" "../tools/winlibs.R" cleanup: $(SHLIB) @rm -Rf $(RWINLIB) ================================================ FILE: src/RcppExports.cpp ================================================ // Generated by using Rcpp::compileAttributes() -> do not edit by hand // Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393 #include "tesseract_types.h" #include using namespace Rcpp; #ifdef RCPP_USE_GLOBAL_ROSTREAM Rcpp::Rostream& Rcpp::Rcout = Rcpp::Rcpp_cout_get(); Rcpp::Rostream& Rcpp::Rcerr = Rcpp::Rcpp_cerr_get(); #endif // tesseract_config Rcpp::List tesseract_config(); RcppExport SEXP _tesseract_tesseract_config() { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; rcpp_result_gen = Rcpp::wrap(tesseract_config()); return rcpp_result_gen; END_RCPP } // tesseract_engine_internal TessPtr tesseract_engine_internal(Rcpp::CharacterVector datapath, Rcpp::CharacterVector language, Rcpp::CharacterVector confpaths, Rcpp::CharacterVector opt_names, Rcpp::CharacterVector opt_values); RcppExport SEXP _tesseract_tesseract_engine_internal(SEXP datapathSEXP, SEXP languageSEXP, SEXP confpathsSEXP, SEXP opt_namesSEXP, SEXP opt_valuesSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< Rcpp::CharacterVector >::type datapath(datapathSEXP); Rcpp::traits::input_parameter< Rcpp::CharacterVector >::type language(languageSEXP); Rcpp::traits::input_parameter< Rcpp::CharacterVector >::type confpaths(confpathsSEXP); Rcpp::traits::input_parameter< Rcpp::CharacterVector >::type opt_names(opt_namesSEXP); Rcpp::traits::input_parameter< Rcpp::CharacterVector >::type opt_values(opt_valuesSEXP); rcpp_result_gen = Rcpp::wrap(tesseract_engine_internal(datapath, language, confpaths, opt_names, opt_values)); return rcpp_result_gen; END_RCPP } // tesseract_engine_set_variable TessPtr tesseract_engine_set_variable(TessPtr ptr, const char * name, const char * value); RcppExport SEXP _tesseract_tesseract_engine_set_variable(SEXP ptrSEXP, SEXP nameSEXP, SEXP valueSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< TessPtr >::type ptr(ptrSEXP); Rcpp::traits::input_parameter< const char * >::type name(nameSEXP); Rcpp::traits::input_parameter< const char * >::type value(valueSEXP); rcpp_result_gen = Rcpp::wrap(tesseract_engine_set_variable(ptr, name, value)); return rcpp_result_gen; END_RCPP } // validate_params Rcpp::LogicalVector validate_params(Rcpp::CharacterVector params); RcppExport SEXP _tesseract_validate_params(SEXP paramsSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< Rcpp::CharacterVector >::type params(paramsSEXP); rcpp_result_gen = Rcpp::wrap(validate_params(params)); return rcpp_result_gen; END_RCPP } // engine_info_internal Rcpp::List engine_info_internal(TessPtr ptr); RcppExport SEXP _tesseract_engine_info_internal(SEXP ptrSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< TessPtr >::type ptr(ptrSEXP); rcpp_result_gen = Rcpp::wrap(engine_info_internal(ptr)); return rcpp_result_gen; END_RCPP } // print_params Rcpp::String print_params(std::string filename); RcppExport SEXP _tesseract_print_params(SEXP filenameSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< std::string >::type filename(filenameSEXP); rcpp_result_gen = Rcpp::wrap(print_params(filename)); return rcpp_result_gen; END_RCPP } // get_param_values Rcpp::CharacterVector get_param_values(TessPtr ptr, Rcpp::CharacterVector params); RcppExport SEXP _tesseract_get_param_values(SEXP ptrSEXP, SEXP paramsSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< TessPtr >::type ptr(ptrSEXP); Rcpp::traits::input_parameter< Rcpp::CharacterVector >::type params(paramsSEXP); rcpp_result_gen = Rcpp::wrap(get_param_values(ptr, params)); return rcpp_result_gen; END_RCPP } // ocr_raw Rcpp::String ocr_raw(Rcpp::RawVector input, TessPtr ptr, bool HOCR); RcppExport SEXP _tesseract_ocr_raw(SEXP inputSEXP, SEXP ptrSEXP, SEXP HOCRSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< Rcpp::RawVector >::type input(inputSEXP); Rcpp::traits::input_parameter< TessPtr >::type ptr(ptrSEXP); Rcpp::traits::input_parameter< bool >::type HOCR(HOCRSEXP); rcpp_result_gen = Rcpp::wrap(ocr_raw(input, ptr, HOCR)); return rcpp_result_gen; END_RCPP } // ocr_file Rcpp::String ocr_file(std::string file, TessPtr ptr, bool HOCR); RcppExport SEXP _tesseract_ocr_file(SEXP fileSEXP, SEXP ptrSEXP, SEXP HOCRSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< std::string >::type file(fileSEXP); Rcpp::traits::input_parameter< TessPtr >::type ptr(ptrSEXP); Rcpp::traits::input_parameter< bool >::type HOCR(HOCRSEXP); rcpp_result_gen = Rcpp::wrap(ocr_file(file, ptr, HOCR)); return rcpp_result_gen; END_RCPP } // ocr_raw_data Rcpp::DataFrame ocr_raw_data(Rcpp::RawVector input, TessPtr ptr); RcppExport SEXP _tesseract_ocr_raw_data(SEXP inputSEXP, SEXP ptrSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< Rcpp::RawVector >::type input(inputSEXP); Rcpp::traits::input_parameter< TessPtr >::type ptr(ptrSEXP); rcpp_result_gen = Rcpp::wrap(ocr_raw_data(input, ptr)); return rcpp_result_gen; END_RCPP } // ocr_file_data Rcpp::DataFrame ocr_file_data(std::string file, TessPtr ptr); RcppExport SEXP _tesseract_ocr_file_data(SEXP fileSEXP, SEXP ptrSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< std::string >::type file(fileSEXP); Rcpp::traits::input_parameter< TessPtr >::type ptr(ptrSEXP); rcpp_result_gen = Rcpp::wrap(ocr_file_data(file, ptr)); return rcpp_result_gen; END_RCPP } static const R_CallMethodDef CallEntries[] = { {"_tesseract_tesseract_config", (DL_FUNC) &_tesseract_tesseract_config, 0}, {"_tesseract_tesseract_engine_internal", (DL_FUNC) &_tesseract_tesseract_engine_internal, 5}, {"_tesseract_tesseract_engine_set_variable", (DL_FUNC) &_tesseract_tesseract_engine_set_variable, 3}, {"_tesseract_validate_params", (DL_FUNC) &_tesseract_validate_params, 1}, {"_tesseract_engine_info_internal", (DL_FUNC) &_tesseract_engine_info_internal, 1}, {"_tesseract_print_params", (DL_FUNC) &_tesseract_print_params, 1}, {"_tesseract_get_param_values", (DL_FUNC) &_tesseract_get_param_values, 2}, {"_tesseract_ocr_raw", (DL_FUNC) &_tesseract_ocr_raw, 3}, {"_tesseract_ocr_file", (DL_FUNC) &_tesseract_ocr_file, 3}, {"_tesseract_ocr_raw_data", (DL_FUNC) &_tesseract_ocr_raw_data, 2}, {"_tesseract_ocr_file_data", (DL_FUNC) &_tesseract_ocr_file_data, 2}, {NULL, NULL, 0} }; RcppExport void R_init_tesseract(DllInfo *dll) { R_registerRoutines(dll, NULL, CallEntries, NULL, NULL); R_useDynamicSymbols(dll, FALSE); } ================================================ FILE: src/tesseract.cpp ================================================ #include "tesseract_types.h" #if TESSERACT_MAJOR_VERSION < 5 #include #define getorat get #else #define STRING std::string #define GenericVector std::vector #define getorat at #endif /* libtesseract 4.0 insisted that the engine is initiated in 'C' locale. * We do this as exemplified in the example code in the libc manual: * https://www.gnu.org/software/libc/manual/html_node/Setting-the-Locale.html * Full discussion: https://github.com/tesseract-ocr/tesseract/issues/1670 */ #if TESSERACT_MAJOR_VERSION == 4 && TESSERACT_MINOR_VERSION == 0 #define TESSERACT40 #endif static tesseract::TessBaseAPI *make_analyze_api(){ #ifdef TESSERACT40 char *old_ctype = strdup(setlocale(LC_ALL, NULL)); setlocale(LC_ALL, "C"); #endif tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI(); api->InitForAnalysePage(); #ifdef TESSERACT40 setlocale(LC_ALL, old_ctype); free(old_ctype); #endif return api; } // [[Rcpp::export]] Rcpp::List tesseract_config(){ tesseract::TessBaseAPI *api = make_analyze_api(); Rcpp::List out = Rcpp::List::create( Rcpp::_["version"] = tesseract::TessBaseAPI::Version(), Rcpp::_["path"] = api->GetDatapath() ); api->End(); delete api; return out; } // [[Rcpp::export]] TessPtr tesseract_engine_internal(Rcpp::CharacterVector datapath, Rcpp::CharacterVector language, Rcpp::CharacterVector confpaths, Rcpp::CharacterVector opt_names, Rcpp::CharacterVector opt_values){ GenericVector params, values; const char * path = NULL; const char * lang = NULL; char * configs[1000] = {0}; if(datapath.length()) path = datapath.at(0); if(language.length()) lang = language.at(0); for(int i = 0; i < confpaths.length(); i++) configs[i] = confpaths.at(i); for(int i = 0; i < opt_names.length(); i++){ params.push_back(std::string(opt_names.at(i)).c_str()); values.push_back(std::string(opt_values.at(i)).c_str()); } #ifdef TESSERACT40 char *old_ctype = strdup(setlocale(LC_ALL, NULL)); setlocale(LC_ALL, "C"); #endif tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI(); int err = api->Init(path, lang, tesseract::OEM_DEFAULT, configs, confpaths.length(), ¶ms, &values, false); #ifdef TESSERACT40 setlocale(LC_ALL, old_ctype); free(old_ctype); #endif if(err){ delete api; throw std::runtime_error(std::string("Unable to find training data for: ") + (lang ? lang : "eng") + ". Please consult manual for: ?tesseract_download"); } TessPtr ptr(api); ptr.attr("class") = Rcpp::CharacterVector::create("tesseract"); return ptr; } tesseract::TessBaseAPI * get_engine(TessPtr engine){ tesseract::TessBaseAPI * api = engine.get(); if(api == NULL) throw std::runtime_error("pointer is dead"); return api; } // [[Rcpp::export]] TessPtr tesseract_engine_set_variable(TessPtr ptr, const char * name, const char * value){ tesseract::TessBaseAPI * api = get_engine(ptr); if(!api->SetVariable(name, value)) throw std::runtime_error(std::string("Failed to set variable ") + name); return ptr; } // [[Rcpp::export]] Rcpp::LogicalVector validate_params(Rcpp::CharacterVector params){ STRING str; tesseract::TessBaseAPI *api = make_analyze_api(); Rcpp::LogicalVector out(params.length()); for(int i = 0; i < params.length(); i++) out[i] = api->GetVariableAsString(params.at(i), &str); api->End(); delete api; return out; } // [[Rcpp::export]] Rcpp::List engine_info_internal(TessPtr ptr){ tesseract::TessBaseAPI * api = get_engine(ptr); GenericVector langs; api->GetAvailableLanguagesAsVector(&langs); Rcpp::CharacterVector available = Rcpp::CharacterVector::create(); for (size_t i = 0; i < langs.size(); i++) available.push_back(langs.getorat(i).c_str()); langs.clear(); api->GetLoadedLanguagesAsVector(&langs); Rcpp::CharacterVector loaded = Rcpp::CharacterVector::create(); for (size_t i = 0; i < langs.size(); i++) loaded.push_back(langs.getorat(i).c_str()); return Rcpp::List::create( Rcpp::_["datapath"] = api->GetDatapath(), Rcpp::_["loaded"] = loaded, Rcpp::_["available"] = available ); } // [[Rcpp::export]] Rcpp::String print_params(std::string filename){ tesseract::TessBaseAPI *api = make_analyze_api(); FILE * fp = fopen(filename.c_str(), "w"); api->PrintVariables(fp); fclose(fp); api->End(); delete api; return filename; } // [[Rcpp::export]] Rcpp::CharacterVector get_param_values(TessPtr ptr, Rcpp::CharacterVector params){ STRING str; tesseract::TessBaseAPI * api = get_engine(ptr); Rcpp::CharacterVector out(params.length()); for(int i = 0; i < params.length(); i++) out[i] = api->GetVariableAsString(params.at(i), &str) ? Rcpp::String(str.c_str()) : NA_STRING; return out; } Rcpp::String ocr_pix(tesseract::TessBaseAPI * api, Pix * image, bool HOCR){ // Get OCR result api->ClearAdaptiveClassifier(); api->SetImage(image); // Workaround for annoying warning, see https://github.com/tesseract-ocr/tesseract/issues/756 if(api->GetSourceYResolution() < 70) api->SetSourceResolution(300); char *outText = HOCR ? api->GetHOCRText(0) : api->GetUTF8Text(); //cleanup pixDestroy(&image); api->Clear(); // Destroy used object and release memory Rcpp::String y(outText); y.set_encoding(CE_UTF8); delete [] outText; return y; } // [[Rcpp::export]] Rcpp::String ocr_raw(Rcpp::RawVector input, TessPtr ptr, bool HOCR = false){ tesseract::TessBaseAPI *api = get_engine(ptr); Pix *image = pixReadMem(input.begin(), input.length()); if(!image) throw std::runtime_error("Failed to read image"); return ocr_pix(api, image, HOCR); } // [[Rcpp::export]] Rcpp::String ocr_file(std::string file, TessPtr ptr, bool HOCR = false){ tesseract::TessBaseAPI *api = get_engine(ptr); Pix *image = pixRead(file.c_str()); if(!image) throw std::runtime_error("Failed to read image"); return ocr_pix(api, image, HOCR); } Rcpp::DataFrame ocr_data_internal(tesseract::TessBaseAPI * api, Pix * image){ api->ClearAdaptiveClassifier(); api->SetImage(image); if(api->GetSourceYResolution() < 70) api->SetSourceResolution(300); api->Recognize(0); tesseract::ResultIterator* ri = api->GetIterator(); tesseract::PageIteratorLevel level = tesseract::RIL_WORD; size_t n = 0; std::list words; std::list bbox; std::list conf; char buf[100]; if (ri) { do { const char * word = ri->GetUTF8Text(level); if(!word) continue; words.push_back(word); conf.push_back(ri->Confidence(level)); int x1, y1, x2, y2; ri->BoundingBox(level, &x1, &y1, &x2, &y2); snprintf(buf, 100, "%d,%d,%d,%d", x1, y1, x2, y2); bbox.push_back(buf); delete[] word; n++; } while (ri->Next(level)); } Rcpp::CharacterVector rwords(n); Rcpp::CharacterVector rbbox(n); Rcpp::NumericVector rconf(n); for(size_t i = 0; i < n; i++) { rwords[i] = words.front(); words.pop_front(); rbbox[i] = bbox.front(); bbox.pop_front(); rconf[i] = conf.front(); conf.pop_front(); } //cleanup pixDestroy(&image); api->Clear(); delete ri; return Rcpp::DataFrame::create( Rcpp::_["word"] = rwords, Rcpp::_["confidence"] = rconf, Rcpp::_["bbox"] = rbbox, Rcpp::_["stringsAsFactors"] = false ); } // [[Rcpp::export]] Rcpp::DataFrame ocr_raw_data(Rcpp::RawVector input, TessPtr ptr){ tesseract::TessBaseAPI *api = get_engine(ptr); Pix *image = pixReadMem(input.begin(), input.length()); if(!image) throw std::runtime_error("Failed to read image"); return ocr_data_internal(api, image); } // [[Rcpp::export]] Rcpp::DataFrame ocr_file_data(std::string file, TessPtr ptr){ tesseract::TessBaseAPI *api = get_engine(ptr); Pix *image = pixRead(file.c_str()); if(!image) throw std::runtime_error("Failed to read image"); return ocr_data_internal(api, image); } ================================================ FILE: src/tesseract_types.h ================================================ #include #include #define R_NO_REMAP #define STRICT_R_HEADERS #include inline void tess_finalizer(tesseract::TessBaseAPI *engine) { engine->End(); delete engine; } typedef Rcpp::XPtr TessPtr; ================================================ FILE: tesseract.Rproj ================================================ Version: 1.0 ProjectId: 953b2ed1-ac9d-4be8-984d-c26d5c642f38 RestoreWorkspace: Default SaveWorkspace: Default AlwaysSaveHistory: Default EnableCodeIndexing: Yes UseSpacesForTab: Yes NumSpacesForTab: 2 Encoding: UTF-8 RnwWeave: Sweave LaTeX: pdfLaTeX AutoAppendNewline: Yes StripTrailingWhitespace: Yes BuildType: Package PackageUseDevtools: Yes PackageInstallArgs: --no-multiarch --with-keep.source PackageRoxygenize: rd,namespace ================================================ FILE: tests/spelling.R ================================================ spelling::spell_check_test(vignettes = TRUE, error = FALSE) ================================================ FILE: tools/test.cpp ================================================ #include #include ================================================ FILE: tools/winlibs.R ================================================ if(!file.exists('tesseract.o') && !file.exists("../.deps/tesseract/include/tesseract/baseapi.h")){ unlink("../.deps", recursive = TRUE) url <- if(grepl("aarch", R.version$platform)){ "https://github.com/r-windows/bundles/releases/download/tesseract-5.3.2/tesseract-ocr-5.3.2-clang-aarch64.tar.xz" } else if(grepl("clang", Sys.getenv('R_COMPILED_BY'))){ "https://github.com/r-windows/bundles/releases/download/tesseract-5.3.2/tesseract-ocr-5.3.2-clang-x86_64.tar.xz" } else if(getRversion() >= "4.3") { "https://github.com/r-windows/bundles/releases/download/tesseract-5.3.2/tesseract-ocr-5.3.2-ucrt-x86_64.tar.xz" } else { "https://github.com/rwinlib/tesseract/archive/v5.3.2.tar.gz" } download.file(url, basename(url), quiet = TRUE) dir.create("../.deps", showWarnings = FALSE) untar(basename(url), exdir = "../.deps", tar = 'internal') unlink(basename(url)) setwd("../.deps") file.rename(list.files(), 'tesseract') # Copy training data file.copy('tesseract/share/tessdata', '../inst/', recursive = TRUE) download.file("https://github.com/tesseract-ocr/tessdata_fast/raw/4.1.0/eng.traineddata", "../inst/tessdata/eng.traineddata", mode = "wb", quiet = TRUE) download.file("https://github.com/tesseract-ocr/tessdata_fast/raw/4.1.0/osd.traineddata", "../inst/tessdata/osd.traineddata", mode = "wb", quiet = TRUE) invisible() } ================================================ FILE: vignettes/intro.Rmd ================================================ --- title: "Using the Tesseract OCR engine in R" date: "`r Sys.Date()`" output: html_document: toc: true toc_depth: 2 toc_float: true fig_caption: false vignette: > %\VignetteIndexEntry{Using the Tesseract OCR engine in R} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo = FALSE, message = FALSE} library(tibble) #knitr::opts_chunk$set(comment = "") has_nld <- "nld" %in% tesseract::tesseract_info()$available if(identical(Sys.info()[['user']], 'jeroen')) stopifnot(has_nld) if(grepl('tesseract.Rcheck', getwd())){ Sys.sleep(10) #workaround for CPU time check } ``` The tesseract package provides R bindings [Tesseract](https://github.com/tesseract-ocr/tesseract): a powerful optical character recognition (OCR) engine that supports over 100 languages. The engine is highly configurable in order to tune the detection algorithms and obtain the best possible results. Keep in mind that OCR (pattern recognition in general) is a very difficult problem for computers. Results will rarely be perfect and the accuracy rapidly decreases with the quality of the input image. But if you can get your input images to reasonable quality, Tesseract can often help to extract most of the text from the image. ## Extract Text from Images OCR is the process of finding and recognizing text inside images, for example from a screenshot, scanned paper. The image below has some example text: ![test](https://jeroen.github.io/images/testocr.png){data-external=1} ```{r} library(tesseract) eng <- tesseract("eng") text <- tesseract::ocr("http://jeroen.github.io/images/testocr.png", engine = eng) cat(text) ``` Not bad! The `ocr_data()` function returns all words in the image along with a bounding box and confidence rate. ```{r} results <- tesseract::ocr_data("http://jeroen.github.io/images/testocr.png", engine = eng) results ``` ## Language Data The tesseract OCR engine uses language-specific training data in the recognize words. The OCR algorithms bias towards words and sentences that frequently appear together in a given language, just like the human brain does. Therefore the most accurate results will be obtained when using training data in the correct language. Use `tesseract_info()` to list the languages that you currently have installed. ```{r} tesseract_info() ``` By default the R package only includes English training data. Windows and Mac users can install additional training data using `tesseract_download()`. Let's OCR a screenshot from Wikipedia in Dutch (Nederlands) [![utrecht](https://jeroen.github.io/images/utrecht2.png)](https://nl.wikipedia.org/wiki/Geschiedenis_van_de_stad_Utrecht) ```{r, eval=FALSE} # Only need to do download once: tesseract_download("nld") ``` ```{r eval = has_nld} # Now load the dictionary (dutch <- tesseract("nld")) text <- ocr("https://jeroen.github.io/images/utrecht2.png", engine = dutch) cat(text) ``` As you can see immediately: almost perfect! (OK just take my word). ## Preprocessing with Magick The accuracy of the OCR process depends on the quality of the input image. You can often improve results by properly scaling the image, removing noise and artifacts or cropping the area where the text exists. See [tesseract wiki: improve quality](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality) for important tips to improve the quality of your input image. The awesome [magick](https://cran.r-project.org/package=magick/vignettes/intro.html) R package has many useful functions that can be use for enhancing the quality of the image. Some things to try: - If your image is skewed, use `image_deskew()` and `image_rotate()` make the text horizontal. - `image_trim()` crops out whitespace in the margins. Increase the `fuzz` parameter to make it work for noisy whitespace. - Use `image_convert()` to turn the image into greyscale, which can reduce artifacts and enhance actual text. - If your image is very large or small resizing with `image_resize()` can help tesseract determine text size. - Use `image_modulate()` or `image_contrast()` or `image_contrast()` to tweak brightness / contrast if this is an issue. - Try `image_reducenoise()` for automated noise removal. Your mileage may vary. - With `image_quantize()` you can reduce the number of colors in the image. This can sometimes help with increasing contrast and reducing artifacts. - True imaging ninjas can use `image_convolve()` to use custom [convolution methods](https://ropensci.org/technotes/2017/11/02/image-convolve/). Below is an example OCR scan. The code converts it to black-and-white and resizes + crops the image before feeding it to tesseract to get more accurate OCR results. ![bowers](https://jeroen.github.io/images/bowers.jpg){data-external=1} ```{r} library(magick) input <- image_read("https://jeroen.github.io/images/bowers.jpg") text <- input %>% image_resize("2000x") %>% image_convert(type = 'Grayscale') %>% image_trim(fuzz = 40) %>% image_write(format = 'png', density = '300x300') %>% tesseract::ocr() cat(text) ``` ## Read from PDF files If your images are stored in PDF files they first need to be converted to a proper image format. We can do this in R using the `pdf_convert` function from the pdftools package. Use a high DPI to keep quality of the image. ```{r, eval=require(pdftools)} pngfile <- pdftools::pdf_convert('https://jeroen.github.io/images/ocrscan.pdf', dpi = 600) text <- tesseract::ocr(pngfile) cat(text) ``` ## Tesseract Control Parameters Tesseract supports hundreds of "control parameters" which alter the OCR engine. Use `tesseract_params()` to list all parameters with their default value and a brief description. It also has a handy `filter` argument to quickly find parameters that match a particular string. ```{r} # List all parameters with *colour* in name or description tesseract_params('colour') ``` Do note that some of the control parameters have changed between Tesseract engine 3 and 4. ```{r} tesseract::tesseract_info()['version'] ``` ### Whitelist / Blacklist characters One powerful parameter is `tessedit_char_whitelist` which restricts the output to a limited set of characters. This may be useful for reading for example numbers such as a bank account, zip code, or gas meter. The whitelist parameter works for all versions of Tesseract engine 3 and also engine versions 4.1 and higher, but unfortunately it did not work in Tesseract 4.0. ![receipt](https://jeroen.github.io/images/receipt.png){data-external=1} ```{r} numbers <- tesseract(options = list(tessedit_char_whitelist = "$.0123456789")) cat(ocr("https://jeroen.github.io/images/receipt.png", engine = numbers)) ``` To test if this actually works, look what happens if we remove the `$` from `tessedit_char_whitelist`: ```{r} # Do not allow any dollar sign numbers2 <- tesseract(options = list(tessedit_char_whitelist = ".0123456789")) cat(ocr("https://jeroen.github.io/images/receipt.png", engine = numbers2)) ```