Tesseract and Magick: High Quality OCR in R
Last week we released an update of the tesseract package to CRAN. This package provides R bindings to Google’s OCR library Tesseract.
The new version ships with the latest libtesseract 3.05.01 on Windows and MacOS. Furthermore it includes enhancements for managing language data and using tesseract together with the magick package.
Installing Language Data
The new version has several improvements for installing additional language data. On Windows and MacOS you use the
tesseract_download() function to install additional languages:
Language data are now stored in
rappdirs::user_data_dir('tesseract') which makes it persist across updates of the package. To OCR french text:
french <- tesseract("fra") text <- ocr("https://jeroen.github.io/images/french_text.png", engine = french) cat(text)
Tesseract and Magick
The tesseract developers recommend to clean up the image before OCR’ing it to improve the quality of the output. This involves things like cropping out the text area, rescaling, increasing contrast, etc.
The rOpenSci magick package is perfectly suitable for this task. The latest version contains a convenient wrapper
image_ocr() that works with pipes.
Let’s give it a try on some example scans:
# Requires devel version of magick # devtools::install_github("ropensci/magick") # Test it library(magick) library(magrittr) text <- image_read("https://courses.cs.vt.edu/csonline/AI/Lessons/VisualProcessing/OCRscans_files/bowers.jpg") %>% image_resize("2000") %>% image_convert(colorspace = 'gray') %>% image_trim() %>% image_ocr() cat(text)
The Llfe and Work of Fredson Bowers by G. THOMAS TANSELLE N EVERY FIELD OF ENDEAVOR THERE ARE A FEW FIGURES WHOSE ACCOM- plishment and inﬂuence cause them to be the symbols of their age; their careers and oeuvres become the touchstones by which the ﬁeld is measured and its history told. In the related pursuits of analytical and descriptive bibliography, textual criticism, and scholarly editing, Fredson Bowers was such a ﬁgure, dominating the four decades after 1949, when his Principles of Bibliographical Description was pub- lished. By 1973 the period was already being called “the age of Bowers”: in that year Norman Sanders, writing the chapter on textual scholarship for Stanley Wells's Shakespeare: Select Bibliographies, gave this title to a section of his essay. For most people, it would be achievement enough to rise to such a position in a ﬁeld as complex as Shakespearean textual studies; but Bowers played an equally important role in other areas. Editors of ninetcemh-cemury American authors, for example, would also have to call the recent past “the age of Bowers," as would the writers of descriptive bibliographies of authors and presses. His ubiquity in the broad ﬁeld of bibliographical and textual study, his seemingly com- plete possession of it, distinguished him from his illustrious predeces- sors and made him the personiﬁcation of bibliographical scholarship in his time. \Vhen in 1969 Bowers was awarded the Gold Medal of the Biblio- graphical Society in London, John Carter’s citation referred to the Principles as “majestic," called Bowers's current projects “formidable," said that he had “imposed critical discipline" on the texts of several authors, described Studies in Bibliography as a “great and continuing achievement," and included among his characteristics "uncompromising seriousness of purpose” and “professional intensity." Bowers was not unaccustomed to such encomia, but he had also experienced his share of attacks: his scholarly positions were not universally popular, and he expressed them with an aggressiveness that almost seemed calculated to
Not bad but not perfect. Can you do a better job?