Craig White wrote:
On Mon, 2008-06-30 at 16:26 +0100, Paul Smith wrote:
> On Sat, Jun 28, 2008 at 5:32 PM, Bob Goodwin USA
> <bobgoodwin(a)wildblue.net> wrote:
>> fred smith wrote:
>>>>>> Is there an F8 application that will convert a .png copy of a
text list
>>>>>> to a text file?
>>>>> ----
>>>>> png is a picture file and there is no text.
>>>>>
>>>>> If you want OCR (optical character recognition - software that scans
a
>>>>> picture for recognizable text and saves the recognized text to a
file),
>>>>> I would suggest tesseract.
>>>> Thanks, I will look at that.
>>>>
>>> I believe that Tesseract only understands TIF files, so you will need
>>> to convert the png before you can OCR them.
>>>
>>>
>> Yes, I discovered that requirement but now I am stumped by -
>>
>> The command line is:
>> tesseract <image.tif> <output> [-l langid]
>>
>> I thought "-l enUS" might work but no go there.
>>
>> There's no man page, only a README and that doesn't tell me about the
langid
>> other than it wants it. Without it I get very strange looking text.
> Unfortunately, the OCR programs working in Linux are not very good
> yet. In case you have access to Acrobat Professional, use it instead;
> the results are usually excellent.
----
I've never used Acrobat Professional for OCR but I have gotten excellent
results from tesseract on Linux.
OP should check out...
http://www.groklaw.net/article.php?story=20061210115516438&query=tess...
http://www.linuxjournal.com/article/9676
I do some similar thing, non-OCR but working with scanned text, and I
use the netpbm package. First I convert the original format to a
greyscale image (aka pgm), then convert that to a bilevel image (aka
black and white) with "pgmtopbm -thr" and setting the value of the
transition as needed (-val option). Those images are then easily
converted to tif or whatever you need, in my case jbig images for bext
compression.
--
Bill Davidsen <davidsen(a)tmr.com>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot