At reinteractive we have recently completed a project calling for us to use OCR (Optical Character Recognition) technology to recognise printed text from photographs. It’s a fun problem to solve, and so here is a brief post on how you can also set up your Rails app with OCR capabilities.
Tesseract
Tesseract is one of the most popular OCR libraries. It’s free and open source, runs on multiple platforms, supports a lot of languages, and its ongoing development is sponsored by Google. It is primarily a command line tool (although there are third-party projects that supply a GUI), and, luckily for us, there are a couple of Ruby gems out there allowing us to interact with it from a Ruby/Rails app. For this post, we will use https://github.com/meh/ruby-tesseract-ocr.
Set Up
First, you will need to install Tesseract. Tesseract is up to version 4.0.0, however this gem is only compatible up to version 3.02.02, so you will need to install that version at the latest. You can do this with your favourite package manager, such as Homebrew (brew install tesseract
).
Next, add the ruby gem to your app using Bundler. Add gem 'tesseract-ocr'
to your Gemfile, and then run bundle install
.
Code
At its most basic, this is all you need to do to OCR an image:
ruby
tesseract = Tesseract::Engine.new do |config|
config.language = :eng
end
# You can also pass an IO object, or even an ImageMagick image.
# Tesseract allows any image format supported by the Leptonica library.
tesseract.text_for('path/to/image.jpg')
The text_for
method is the simplest way of using the gem; it simply returns all the text that it can find as a single string. However, you can also interact with it at varying levels of granularity (ie, blocks, paragraphs, lines, words, and symbols). There are accessor methods supplied for each level of granularity (each_paragraph
, each_line
, etc) and they all work the same way. Once you have decided which level of granularity you are going to go with (in the below example we will use lines) there are two ways to get the results:
You can execute a block for each paragraph/line/etc:
ruby
tesseract.image = 'path/to/image.jpg'
tesseract.each_line do |line|
line.text
end
Or you can get an array of each paragraph/line/etc:
ruby
tesseract.image = 'path/to/image.jpg'
tesseract.lines.each do |line|
line.text
end
Once you have the results, whether yielded or returned, you can inspect them to see how accurate the OCR was (there are more methods than just these three, but these are the most important ones):
ruby
# The OCRd text.
> line.text
=> "Lorem ipsum dolor sit amet..."
# The coordinates of the element on the image. You can get the position and size with methods such as left, width, etc.
> line.bounding_box
=> #<BoundingBox(20, 62): 1421x558>
# How confident Tesseract is that the text is correct.
> line.confidence
=> 47.571746826171875
Accuracy
The above is all you need to get results from Tesseract. However, the real issue is accuracy. The accuracy of the results will depend on a number of factors, such as the quality of the image (is it a photograph or a scan?), shadows, rotation, etc. You may need to do some preprocessing of the image in order to increase the accuracy of the output. You might find it helpful to use RMagick and ImageMagick to crop, rotate, or resize images before running them through Tesseract. For example, in my use case, I needed to OCR labels issued by hospitals with patient information on them, rather than standard documents with lines and paragraphs of text. I found it helpful to crop out only the fragments of the labels that I needed, it order to prevent Tesseract from getting thrown off by barcodes and other odd symbols.
Your experience with Tesseract will thus be dependent on the quality of your input images, and how well you are able to clean them up prior to running them through Tesseract. However, if your inputs are good, then the excellent OCR capabilities provided by Tesseract and the simple API provided by this gem should make recognising text from your images a breeze.