A question on OCR for bad old document?

Joachim Backes joachim.backes at rhrk.uni-kl.de
Sun Jun 13 08:28:25 UTC 2010


On 06/13/10 09:29, Joel Rees wrote:
> On Mon, Jun 7, 2010 at 8:25 AM, Jim <mickeyboa at sbcglobal.net> wrote:
>> On 06/06/2010 05:19 PM, Frank Cox wrote:
>>> On Sun, 2010-06-06 at 22:01 +0100, mike cloaked wrote:
>>>
>>>> I have a scanned pdf of a very old document which was typewritten
>>>> about half a century ago. The scanned copy is noisy and the letters
>>>> are far from clear. The text can be made out (mostly) by eye, but it
>>>> is 19 pages long and I would like to OCR it to get a digitised text to
>>>> save the eye strain and lots of typing.
>>>>
>>> You can't make a silk purse out of a sow's ear.
>>>
>>> If you are having difficulty reading the scan yourself, then you're
>>> probably out of luck getting the computer to OCR it for you.
>>>
>>> Your best bet is to retype it.  It's only 19 pages so it shouldn't take
>>> too long to type it again.  You'll spend far more time fiddling around
>>> (unsuccessfully) with OCR stuff than it will take to retype it anyway.
>>>
>> Scanning a Text doc is not going to Save properly in Xsane/Linux, even
>> if you use "gocr"
>> Scanning and "Saving Text" is broken.
>>
>> As far as how a text looks on your terminal after scanning, It always
>> looks bad. You have to Save As" to get good finish product, and again
>> "Save As" Text is broken in Xsane. only Images turn out after "Saving"
> 
> Can you use the "copy/paste" (Select the text and Edit->Copy) pipe?
> 
> (I suppose I should grab the current ocr downloads and give them a
> try. I have to say, it seems like about four years ago, all the open
> source ocr projects just stopped moving.)

I'm using *tesseract* for extracting text from a tiff (scan with xsane
into .tif) file containing text and get good results:

yum install tesseract

-- 
Joachim Backes <joachim.backes at rhrk.uni-kl.de>

http://www.rhrk.uni-kl.de/~backes

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 6131 bytes
Desc: S/MIME Cryptographic Signature
Url : http://lists.fedoraproject.org/pipermail/users/attachments/20100613/37d92eff/attachment-0001.bin 


More information about the users mailing list