Convert PDF to Text?

Keith G. Robertson-Turner fedora-gmane.00003 at genesis-x.nildram.co.uk
Sun Apr 22 00:33:32 UTC 2007


Verily I say unto thee, that bdk at unb.ca spake thusly:
> I think pdftohtml is part of
> 
> poppler-utils

Got it, thanks.

However, now there's another problem - it doesn't really work.

All it produces is "empty" html files, that is - they are proper html
(head, body, etc.) but the actual content is not there.

IOW it looks like it can only work if the content of the PDF really is
text, and not a scanned image of text.

This definitely works with Evince, I just wish there was a way to
automate it with a batch script, rather than me having to copy and paste
the text out of 2000 documents.

Here's the original PDF file:

http://antitrust.slated.org/www.iowaconsumercase.org/011607/0000/PX00111.pdf

And here's a video of Evince "OCRing" the text from the image:

http://media.slated.org/albums/userpics/Evince_podit.mp4 (H264 MP4)

Download the PDF and try it yourself.

It's bizarre, surely there's a way to automate this?

TIA.

-- 
K.
http://slated.org

.----
| I found [Vista] to be a dangerously unstable operating system,
| which has caused me to lose data ... unfortunately this product
| is unfit for any user. - [H]ardOCP, <http://tinyurl.com/3bpfs2>
`----

Fedora Core release 5 (Bordeaux) on sky, running kernel 2.6.20-1.2312.fc5
 01:31:48 up 4 days, 23:03,  3 users,  load average: 0.57, 0.52, 0.54




More information about the users mailing list