I come bearing gifts, if anyone would like to host some of this themselves.
This follows the API documented by Stampin (minus the throttling errors)--it does not currently do the OCR, but as mentioned elsewhere by zdw you can probably get tesseract to get you like 80% of the way there. If you wanted to use that, you'd likely just replace the hacky `pdftotext` callout with your preferred toolchain.
You'll need Ruby, Sinatra, and the Xpdf tools, I believe.
Dual-licensed under the AGPL, BSD, and WTFPL licenses. idklol.
The code:
require 'sinatra'
require 'json'
use Rack::Logger
post '/extracttext' do
begin
status 204 and return unless params["file"] != nil
type = params["type"] || "text"
lang = params["lang"] || "en"
tmpfilename = params["file"][:tempfile].path
`pdftotext #{tmpfilename}`
File.delete(tmpfilename)
convfile = File.open("#{tmpfilename}.txt","r")
lines = convfile.read.split("\n")
convfile.close
File.delete(convfile.path)
content_type "application/json"
{"text"=>lines}.to_json
rescue
status 500 and return
end
end
EDIT:
For God's sake run this in a jail and only on an internal network!
This follows the API documented by Stampin (minus the throttling errors)--it does not currently do the OCR, but as mentioned elsewhere by zdw you can probably get tesseract to get you like 80% of the way there. If you wanted to use that, you'd likely just replace the hacky `pdftotext` callout with your preferred toolchain.
You'll need Ruby, Sinatra, and the Xpdf tools, I believe.
Dual-licensed under the AGPL, BSD, and WTFPL licenses. idklol.
The code:
EDIT:For God's sake run this in a jail and only on an internal network!