A question on OCR for bad old document?

Sun Jun 13 15:28:41 UTC 2010

On 06/13/2010 04:28 AM, Joachim Backes wrote:
> On 06/13/10 09:29, Joel Rees wrote:
>    
>> On Mon, Jun 7, 2010 at 8:25 AM, Jim<mickeyboa at sbcglobal.net>  wrote:
>>      
>>> On 06/06/2010 05:19 PM, Frank Cox wrote:
>>>        
>>>> On Sun, 2010-06-06 at 22:01 +0100, mike cloaked wrote:
>>>>
>>>>          
>>>>> I have a scanned pdf of a very old document which was typewritten
>>>>> about half a century ago. The scanned copy is noisy and the letters
>>>>> are far from clear. The text can be made out (mostly) by eye, but it
>>>>> is 19 pages long and I would like to OCR it to get a digitised text to
>>>>> save the eye strain and lots of typing.
>>>>>
>>>>>            
>>>> You can't make a silk purse out of a sow's ear.
>>>>
>>>> If you are having difficulty reading the scan yourself, then you're
>>>> probably out of luck getting the computer to OCR it for you.
>>>>
>>>> Your best bet is to retype it.  It's only 19 pages so it shouldn't take
>>>> too long to type it again.  You'll spend far more time fiddling around
>>>> (unsuccessfully) with OCR stuff than it will take to retype it anyway.
>>>>
>>>>          
>>> Scanning a Text doc is not going to Save properly in Xsane/Linux, even
>>> if you use "gocr"
>>> Scanning and "Saving Text" is broken.
>>>
>>> As far as how a text looks on your terminal after scanning, It always
>>> looks bad. You have to Save As" to get good finish product, and again
>>> "Save As" Text is broken in Xsane. only Images turn out after "Saving"
>>>        
>> Can you use the "copy/paste" (Select the text and Edit->Copy) pipe?
>>
>> (I suppose I should grab the current ocr downloads and give them a
>> try. I have to say, it seems like about four years ago, all the open
>> source ocr projects just stopped moving.)
>>      
> I'm using *tesseract* for extracting text from a tiff (scan with xsane
> into .tif) file containing text and get good results:
>
> yum install tesseract
>
>    
What command in tesseract do you use to extract the tiff file ?