RTesseract: How to return recognized areas in an array instead of joined
Recently, I was working on a project where we had experimented with RTesseract for OCR of information from scanned documents. We’d have a document uploaded, there was an areas template, and we’d run through the uploaded document and recognize the text in the areas. The rtesseract gem was chosen because it’s simple, and it’s setup was simpler then the other one, the name of which escapes me at the moment.
One feature that did need to be changed is having the gem return an array for the text recognized by the areas instead of returning everything as one string.
Luckily, the gem itself does a .join
to return a single string so all we needed to do was override the return method and remote the join
part.
# config/initalizers/rtesseract_mixed.rb
class RTesseract
# Class to read an image from specified areas
class Mixed
# This is the return method from the gem itself.
# Original source can be found here: https://github.com/dannnylo/rtesseract/blob/cf07cad5ec2d3ad6011f84bdd6426f917c4833cc/lib/rtesseract/mixed.rb
def to_s
return @value if @value != ''
if @source.file?
convert
# The following line replaced @value.join
@value.map{|x| x.strip}
else
fail RTesseract::ImageNotSelectedError.new(@source)
end
end
end
end