Ognjen Regoje bio photo

MY NAME IS
Ognjen Regoje
BUT YOU CAN CALL ME OGGY


I make things that run on the web (mostly).
More /ABOUT me.

me@ognjen.io Twitter LinkedIn Github

RTesseract: How to return recognized areas in an array instead of joined

Recently, I was working on a project where we had experimented with RTesseract for OCR of information from scanned documents. We’d have a document uploaded, there was an areas template, and we’d run through the uploaded document and recognize the text in the areas. The rtesseract gem was chosen because it’s simple, and it’s setup was simpler then the other one, the name of which escapes me at the moment.

One feature that did need to be changed is having the gem return an array for the text recognized by the areas instead of returning everything as one string.

Luckily, the gem itself does a .join to return a single string so all we needed to do was override the return method and remote the join part.

# config/initalizers/rtesseract_mixed.rb

class RTesseract
  # Class to read an image from specified areas
  class Mixed

    # This is the return method from the gem itself.
    # Original source can be found here: https://github.com/dannnylo/rtesseract/blob/cf07cad5ec2d3ad6011f84bdd6426f917c4833cc/lib/rtesseract/mixed.rb
    def to_s
      return @value if @value != ''
      if @source.file?
        convert
        # The following line replaced @value.join
        @value.map{|x| x.strip}
      else
        fail RTesseract::ImageNotSelectedError.new(@source)
      end
    end
  end
end

#rtesseract #ruby #technical #tesseract