A question on OCR for bad old document?

mike cloaked mike.cloaked at gmail.com
Sun Jun 6 21:01:32 UTC 2010


I have a scanned pdf of a very old document which was typewritten
about half a century ago. The scanned copy is noisy and the letters
are far from clear. The text can be made out (mostly) by eye, but it
is 19 pages long and I would like to OCR it to get a digitised text to
save the eye strain and lots of typing.

I have tried various routes to doing this, including converting the
pdf to jpg, tif and other formats after fiddling with it in GIMP to
turn it (not very well) from grey scale to monochrome with an indexed
image before trying to OCR it. I have tried GOCR, OCRAD and gscan2pdf
but all give pretty awful results with a very low success rate.

Does anyone have any guidance or a url to point me to that may help
with turning that scanned old document into something sensible as a
character file within Fedora ?

Thanks in advance for any tips.

-- 
mike c


More information about the users mailing list