Date last run: 07Dec2019
Introduction
I posted before about using magick and tesseract in postings here and here. Looking again at tesseract because the inputs were changed, I noticed two possibilities to improve the workflow:
- using the
tesseract::ocr_data
function that returns a data.frame with the scanned text and a confidence rate and a bounding box for this text - the
HOCR
argument of thetesseract::ocr
function that returns an XHTML document with the same elements astesseract::ocr_data
but with an additional line number
When I used the tesseract::ocr_data
, I needed to derive the line number from the bounding box, so I decided to go for the tesseract::ocr(HOCR=T)
option. A part of the XHTML document is listed here:
<span class='ocr_line' id='line_1_5' title="bbox 19 227 1087 251; baseline 0 -8; x_size 14.113145; x_descenders 3.1131446; x_ascenders 3.6521802">
<span class='ocrx_word' id='word_1_28' title='bbox 19 227 147 251; x_wconf 90'>Standaardbereik</span>
<span class='ocrx_word' id='word_1_29' title='bbox 198 232 242 243; x_wconf 86'>1418</span>
On the internet I found a snippet by Jeroen Ooms that converts the XHTML document to a data.frame. I created a function from the snippet in which I also extracted the line number and made the extraction of the confidence rate conditional. In the previous blogs I used scans to extract the results of medical (blood) tests.
Including cleanup_bw
, scan_with_hocr
and extract_table
in package
The functions used to do the extraction of the medical test results, were generalized for extracting a table from a scan. And being more general they were included in the R-package HOQCutil . Using among others these functions the total workflow then is:
- read an image from file with
magick::image_read
e.g.img1 = magick::image_read('example1.png')
- define the list with cleanup options
e.g.cln_options1 = list(resize="4000x",trim=10,enhance=TRUE,sharpen=1)
- use the HOQCutil::cleanup_bw function with this list
e.g.img2 = HOQCutil::cleanup_bw (img1,cln_options1)
- scan (OCR) the cleansed image with HOQCutil::scan_with_hocr
e.g.df1 = HOQCutil::scan_with_hocr(img2,add_header_cols=F)
- indicate in the columns of
df1
which fields belong to the table headers (or alternatively define aheaders
list) - extract the table with the HOQCutil::extract_table function
e.g.df2= HOQCutil::extract_table(df1, headers=NULL,lastline = Inf, desc_above=T)
or alternatively
df2= HOQCutil::extract_table(df1, headers=hdr_desc,lastline = Inf, desc_above=T)
The functions are (of course) described in the help
of the HOQCutil package under the name of the function or under scanner_functions
. The remainder of this blog shows how this works on an example file.
Example
We will try to scan the table in Figure 1 . We created the .PNG file ourselves (from an MS Excel sheet) so we could give it some ‘features’:
- headers are placed over two lines and contain two words (except last header)
- the last header contains three words
- two descriptions are placed over two lines
- we used the ‘Verdana’ font. With other fonts the scans were far worse!
Read the image from file
filename = 'example.PNG'
img1 = magick::image_read(filename)
Cleanup the file
The result can be seen in Figure 2 .
cln_options1 = list(resize="4000x",trim=10,enhance=TRUE,sharpen=1)
img2 = HOQCutil::cleanup_bw(img1,cln_options1)
img2
OCR the cleansed image
In this step the actual scanning takes place. The results are not too bad but the first and last (is this a coincidence?) row of the table are not correctly scanned. We will show the result (inTable 1) after the next step.
df1 = HOQCutil::scan_with_hocr(img2,add_header_cols = T)
Prepare for extracting the table
In the previous step we added two columns (header_col
and header_col_seq
) to the scan results. We use these to indicate which elements will be the header of the table. This could be done interactively by using edit
but here we do it by coding:
df1$header_col[c(9,10,11,12,13,14,15,16,17)]= c(1,2,3,4,4,1,2,3,4)
The resulting data.frame can be seen in Table 1:.
line | fld_nr | word | x1 | y1 | y2 | x2 | header_col | header_col_seq |
---|---|---|---|---|---|---|---|---|
1 | 1 | Tnis | 9 | 0 | 101 | 251 | 0 | 1 |
1 | 2 | is | 303 | 0 | 101 | 389 | 0 | 1 |
1 | 3 | an | 442 | 27 | 101 | 577 | 0 | 1 |
1 | 4 | example | 630 | 0 | 120 | 1143 | 0 | 1 |
1 | 5 | of | 1192 | 0 | 101 | 1311 | 0 | 1 |
1 | 6 | a | 1350 | 27 | 101 | 1415 | 0 | 1 |
1 | 7 | table | 1461 | 0 | 101 | 1765 | 0 | 1 |
1 | 8 | (Verdana) | 1815 | 0 | 120 | 2412 | 0 | 1 |
2 | 1 | jan | 1299 | 314 | 437 | 1481 | 1 | 1 |
2 | 2 | feb | 2005 | 310 | 441 | 2176 | 2 | 1 |
2 | 3 | mar | 2693 | 310 | 441 | 2908 | 3 | 1 |
2 | 4 | first | 3259 | 314 | 417 | 3497 | 4 | 1 |
2 | 5 | quarter | 3543 | 325 | 436 | 3992 | 4 | 1 |
3 | 1 | 2019 | 1253 | 478 | 609 | 1533 | 1 | 1 |
3 | 2 | 2019 | 1954 | 478 | 609 | 2223 | 2 | 1 |
3 | 3 | 2019 | 2665 | 478 | 609 | 2945 | 3 | 1 |
3 | 4 | 2019 | 3475 | 492 | 585 | 3767 | 4 | 1 |
4 | 1 | city | 575 | 646 | 777 | 789 | 0 | 1 |
4 | 2 | 101 | 1302 | 660 | 753 | 1495 | 0 | 1 |
4 | 3 | 102 | 2004 | 660 | 753 | 2216 | 0 | 1 |
4 | 4 | 103 | 2705 | 660 | 753 | 2917 | 0 | 1 |
4 | 5 | 306 | 3515 | 660 | 753 | 3728 | 0 | 1 |
5 | 1 | 1 | 591 | 819 | 911 | 643 | 0 | 1 |
6 | 1 | city | 575 | 957 | 1078 | 789 | 0 | 1 |
6 | 2 | 2 | 838 | 966 | 1059 | 902 | 0 | 1 |
6 | 3 | 203 | 1292 | 966 | 1059 | 1505 | 0 | 1 |
6 | 4 | 205 | 2695 | 966 | 1059 | 2916 | 0 | 1 |
6 | 5 | 408 | 3513 | 966 | 1059 | 3727 | 0 | 1 |
7 | 1 | provinc | 580 | 1115 | 1236 | 1018 | 0 | 1 |
7 | 2 | complete | 1118 | 1115 | 1236 | 1676 | 0 | 1 |
7 | 3 | incomplete | 1775 | 1115 | 1236 | 2436 | 0 | 1 |
7 | 4 | complete | 2531 | 1115 | 1236 | 3078 | 0 | 1 |
7 | 5 | incomplete | 3296 | 1115 | 1236 | 3947 | 0 | 1 |
8 | 1 | el | 581 | 1283 | 1373 | 762 | 0 | 1 |
Extract the table
The last step is done with the HOQCutil::extract_table
function. The resulting table is listed in Table 2. We use desc_above=F
because we see in Figure 1 (and Figure 2) that the descriptions are not above the data line. See e.g. the ‘1’ for the first city line and the ‘e 1’ for the last data line.
df2 = HOQCutil::extract_table(df1, headers=NULL, lastline = Inf, desc_above=F)
We also see in Table 2 that the description ‘provinc el’ is not correct. With a perfect scan this would have been ‘provinc e 1’, because descriptions are pasted with a space as separator. So in this case a small correction will have to be made. Also the NA
(because no data was present in that cell) should be replaced by whatever the user thinks feasible.
Lesson: always check the results of the scan.
desc | jan 2019 | feb 2019 | mar 2019 | first quarter 2019 |
---|---|---|---|---|
city 1 | 101 | 102 | 103 | 306 |
city 2 | 203 | NA | 205 | 408 |
provinc el | complete | incomplete | complete | incomplete |
Alternative for the extract
In the previous step we could use the column header_col
that we filled in an earlier step with the information concerning headers. However we can also specify this information in list
-form:
hdr_desc = list(
list(c(2,1),c(3,1)),
list(c(2,2),c(3,2)),
list(c(2,3),c(3,3)),
list(c(2,4),c(2,5),c(3,4))
)
df3 = HOQCutil::extract_table(df1,headers=hdr_desc, lastline = Inf, desc_above=T)
In Table 3 we see that this leads to the same headers.
We also show the effect of specifying (wrongly) desc_above=T
.
desc | jan 2019 | feb 2019 | mar 2019 | first quarter 2019 |
---|---|---|---|---|
city | 101 | 102 | 103 | 306 |
1 city 2 | 203 | NA | 205 | 408 |
provinc | complete | incomplete | complete | incomplete |
Session Info
This document was produced on 07Dec2019 with the following R environment:
#> R version 3.6.0 (2019-04-26)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18362)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_United States.1252
#> [2] LC_CTYPE=English_United States.1252
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.3 knitr_1.26 xml2_1.2.2 magrittr_1.5
#> [5] rappdirs_0.3.1 tidyselect_0.2.5 tesseract_4.1 HOQCutil_0.1.15
#> [9] R6_2.4.1 rlang_0.4.2 stringr_1.4.0 highr_0.8
#> [13] dplyr_0.8.3 tools_3.6.0 xfun_0.10 ellipsis_0.3.0
#> [17] htmltools_0.4.0 digest_0.6.23 assertthat_0.2.1 lifecycle_0.1.0
#> [21] tibble_2.1.3 crayon_1.3.4 tidyr_1.0.0 purrr_0.3.3
#> [25] captioner_2.2.3 vctrs_0.2.0 zeallot_0.1.0 glue_1.3.1
#> [29] evaluate_0.14 rmarkdown_1.18 stringi_1.4.3 pillar_1.4.2
#> [33] compiler_3.6.0 backports_1.1.5 magick_2.2 pkgconfig_2.0.3