PDF to text?

Bob Goodwin bobgoodwin at wildblue.net
Fri Aug 12 23:02:53 UTC 2011


On 12/08/11 18:25, Cameron Simpson wrote:
> On 12Aug2011 12:09, Bob Goodwin<bobgoodwin at wildblue.net>  wrote:
> | On 12/08/11 12:04, Genes MailLists wrote:
> |>  On 08/12/2011 11:58 AM, Bob Goodwin wrote:
> |>>  On 12/08/11 11:22, Genes MailLists wrote:
> |>>>  On 08/12/2011 11:16 AM, Madhav Ancha wrote:
> |>>>      You could try this fedora app:  pdftotext
> |>>>
> |>>           As can be seen I tried several combinations, thought perhaps it
> |>>           couldn't handle the file nam in quotes "Couier  etc" but nothing
> |>>           seems to do it?
> |>>
> |>     Is it possible the PDF contains an image of the text rather than text
> |>  itself ?
> |
> |         I'm not sure, how would I tell? It's an attachment to an html
> |         cover letter. The Fedora default app, disolays it with no
> |         complaints.
>
> Is it ridiculously large for the amount of text? Does it seem to have
> scanner artifacts in the text - "graininess" if you peer closely, fuzzy
> text instead of perfectly formed letters (i.e. a picture of text instead
> of text rendered by your computer from a font)?
>
> Personally I use pdftohtml to convert PDFs (then an HTML-to-text
> pipeline on the end of that). Possibly pdftotext does exactly that
> anyway. Of course it achieves nothing for me if the PDF is a scan.
>
> Cheers,

        It's a scan.

        pdftohtml seems to have produced jpeg as well as html files.

            -rw-rw-r--. 1 bobg bobg  321444 Aug 12 18:37 Courier-1_1.jpg
            -rw-rw-r--. 1 bobg bobg  309493 Aug 12 18:37 Courier-2_1.jpg
            -rw-rw-r--. 1 bobg bobg     461 Aug 12 18:37 Courier.html
            -rw-rw-r--. 1 bobg bobg     244 Aug 12 18:37 Courier_ind.html

        The html files display as a couple of boxes, the jpegs are sharp
        reproductions of the text and can be converted with gocr to
        text. But the quality of that text leaves much to be desired. I
        might be able to work it over with a dictionary to fill in the
        missing words, missing being words that appear as gibberish.

        Thanks, I'll have a go at that later.

        Bob





More information about the users mailing list