Comment by angersock - Hacker Neue

angersock Jul 18, 2013 parent

I come bearing gifts, if anyone would like to host some of this themselves.

This follows the API documented by Stampin (minus the throttling errors)--it does not currently do the OCR, but as mentioned elsewhere by zdw you can probably get tesseract to get you like 80% of the way there. If you wanted to use that, you'd likely just replace the hacky `pdftotext` callout with your preferred toolchain.

You'll need Ruby, Sinatra, and the Xpdf tools, I believe.

Dual-licensed under the AGPL, BSD, and WTFPL licenses. idklol.

The code:

  require 'sinatra'
  require 'json'

  use Rack::Logger

  post '/extracttext' do

      begin
      status 204 and return unless params["file"] != nil

      type = params["type"] || "text"
      lang = params["lang"] || "en"

      tmpfilename = params["file"][:tempfile].path
      `pdftotext #{tmpfilename}`
      File.delete(tmpfilename)

      convfile = File.open("#{tmpfilename}.txt","r")
      lines = convfile.read.split("\n")
      convfile.close
      File.delete(convfile.path)

      content_type "application/json"
      {"text"=>lines}.to_json

      rescue
          status 500 and return
      end
  end

EDIT:

For God's sake run this in a jail and only on an internal network!

This item has no comments currently.