Date last run: 15Sep2019
Since I posted about this subject new versions of the magick and tesseract packages became available. I will try to redo the previous
analysis. That is: I want to see if I can use the Tesseract 4
engine with a whitelist
.
Prepare for scanning (OCR)
filename = 'uitslag1.png'
img = magick::image_read(filename)
magick::image_info(img)
#> # A tibble: 1 x 7
#> format width height colorspace matte filesize density
#> <chr> <int> <int> <chr> <lgl> <int> <chr>
#> 1 PNG 1145 374 sRGB TRUE 27658 38x38
Do the scan (OCR)
The magick::image_ocr
function uses the tesseract package to do the actual scan. According to the vignette : “The package provides R bindings for Tesseract: a powerful optical character recognition (OCR) engine that supports over 100 languages.” In this post I only used standard language English (‘eng’).
Use Tesseract 4 engine
Triggered by the underscore I tried to find a way to specify to the ocr engine which characters it should recognize. This ‘whitelist’ functionality is now again available. So it is no longer necessary to specify tessedit_ocr_engine_mode='0'
.
In the previous post we started using magick::image_ocr(img,language='nld')
and magick::image_ocr(img,language='eng')
that call the tesseract::ocr
function with the default engine (Tesseract 4) and the given language. Now we will explicitly use the Tesseract
functions. So we define the engine engine4
with the tesseract::tesseract
function. We also specify (redundantly) the datapath to the version 4 files.
tesseract4 = "C:\\Users\\Han\\AppData\\Local\\tesseract4\\tesseract4\\tessdata"
whitelist = "abcdefghijklmnopqrtsuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 -()',.</"
tess_opts = list(tessedit_char_whitelist = whitelist)
engine4 = tesseract::tesseract(language='eng', datapath = tesseract4, options = tess_opts)
`%>%` = magrittr::`%>%`
txt = tesseract::ocr(img, engine = engine4) %>%
stringr::str_split_fixed(., '\n', Inf) %>%
as.character(.)
stringr::str_sub(txt[5],61,-1) # (part of) line 5 scanned with eng engine
#> [1] "8 25-7-18 12-11-18 243-19 27-5-19"
stringr::str_sub(txt[6],51,-1) # (part of) line 6 scanned with eng engine
#> [1] " 1,25 114 113 0,87"
We see that this is still not good enough. See Figure 1 . Just as in the earlier post we need to improve the results by preprocessing the image before doing the actual scan.
Preprocess the image.
The package magick
has a lot of functions to handle images. I combined some of these in a function to improve the readability of the image. Because I did not know beforehand which of them I would use, I parametrised the function with a specification list.
clean_up <- function (img,myoptions) {
force(myoptions)
if (!is.null(myoptions$trim)) {
img = magick::image_trim(img,fuzz = myoptions$trim)
}
if (!is.null(myoptions$resize)) {
img = magick::image_resize(img,myoptions$resize)
}
if (!is.null(myoptions$brightness)) {
brightness = myoptions$brightness
} else {
brightness = 100
}
if (!is.null(myoptions$saturation)) {
saturation = myoptions$saturation
} else {
saturation = 100
}
if (!is.null(myoptions$hue)) {
hue = myoptions$hue
} else {
hue = 100
}
img = magick::image_modulate(img,
brightness=brightness, saturation=saturation, hue=hue)
if (!is.null(myoptions$sharpen)) {
img = magick::image_contrast(img,sharpen=myoptions$sharpen)
}
img = magick::image_background(
magick::image_transparent(img, 'white', fuzz = 25), 'white')
img = magick::image_quantize(img,colorspace ="gray")
img = magick::image_background(
magick::image_transparent(img, 'black', fuzz =75), 'black')
if ( (!is.null(myoptions$enhance)) && myoptions$enhance == TRUE) {
img = magick::image_enhance(img)
}
img
}
Clean the image with the following parameters
clean_options = list(resize="4000x",convert_type='Grayscale',
trim=10,enhance=TRUE,sharpen=1)
img2 = clean_up(img, clean_options)
The relevant part of the image then looks like Figure 2 :
Scan the cleansed image with Tesseract 4
txt = tesseract::ocr(img2, engine = engine4) %>%
stringr::str_split_fixed(., '\n', Inf) %>%
as.character(.)
stringr::str_sub(txt[5],63,-1) # line 5 scanned with eng engine
#> [1] " 25-17-18 12-11-18 24-3-19 27-5-19"
stringr::str_sub(txt[6],51,-1) # line 6 scanned with eng engine
#> [1] " 1,25 1,14 1,13 0,87"
All characters are now correctly converted. For privacy reasons only about 25% of the document was shown but all dates and numbers were converted correctly for this document (and 16 others with the same characteristics).
Conclusion
It was necessary to clean the image to get a good scan of the image.
However there is no need to fall back to an older version of Tesseract to use a whitelist.
Session Info
sessionInfo()
#> R version 3.6.0 (2019-04-26)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18362)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_United States.1252
#> [2] LC_CTYPE=English_United States.1252
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] HOQCutil_0.1.10 jsonlite_1.6 glue_1.3.1 purrr_0.3.2
#> [5] xml2_1.2.2 ggspatial_1.0.3 ggplot2_3.2.1 sf_0.7-7
#> [9] dplyr_0.8.3 stringr_1.4.0 osmdata_0.1.1
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.2 lubridate_1.7.4 lattice_0.20-38
#> [4] tidyr_1.0.0 png_0.1-7 class_7.3-15
#> [7] assertthat_0.2.1 zeallot_0.1.0 digest_0.6.20
#> [10] utf8_1.1.4 R6_2.4.0 cellranger_1.1.0
#> [13] plyr_1.8.4 backports_1.1.4 evaluate_0.14
#> [16] e1071_1.7-0 httr_1.4.1 highr_0.8
#> [19] blogdown_0.15 pillar_1.4.2 rlang_0.4.0
#> [22] lazyeval_0.2.1 curl_4.0 readxl_1.3.1
#> [25] magick_2.2 rmarkdown_1.15 rgdal_1.4-4
#> [28] munsell_0.5.0 rosm_0.2.5 compiler_3.6.0
#> [31] xfun_0.8 pkgconfig_2.0.2 prettymapr_0.2.2
#> [34] htmltools_0.3.6 tidyselect_0.2.5 tibble_2.1.3
#> [37] fansi_0.4.0 crayon_1.3.4 withr_2.1.2
#> [40] rappdirs_0.3.1 grid_3.6.0 gtable_0.3.0
#> [43] lifecycle_0.1.0 DBI_1.0.0 magrittr_1.5
#> [46] units_0.6-2 scales_1.0.0 KernSmooth_2.23-15
#> [49] cli_1.1.0 stringi_1.4.3 fs_1.3.1
#> [52] sp_1.3-1 vctrs_0.2.0 captioner_2.2.3
#> [55] tools_3.6.0 tesseract_4.1 colorspace_1.4-1
#> [58] classInt_0.3-3 rvest_0.3.4 knitr_1.24