Optical Character Recgnition with imagerExtra
Shota Ochi
2018-07-21
We can do optical character recgnition by using the R package tesseract.
ocr function of tesseract works best for images with high contrast, little noise and horizontal text.
Then, how should we extract text from a degraded image?
A way is that we enhance contrast of image and denoise image with imagerExtara before using ocr function.
It’s convenient we have a function that is shortcut to ocr function if we adopt the way.
That’s why I implemented the functions (OCR and OCR_data) that are shortcuts to ocr function and ocr_data function of tesseract.
You can use them as show below.
library(devtools)
install_github("ShotaOchi/imagerExtra")
library(imagerExtra)
layout(matrix(1:2, 1, 2, byrow=TRUE))
plot(papers, main = "Original")
hello <- DenoiseDCT(papers, 0.01) %>% ThresholdAdaptive(., 0.1, range = c(0,1)) %>% plot(main = "Cleaned")
OCR(hello) %>% cat
Hello
OCR_data(hello)
# A tibble: 1 x 3
word confidence bbox
<chr> <dbl> <chr>
1 Hello 69.1 8,9,118,54