Copying text from a protected pdf file

Paul Smith phhs80 at gmail.com
Thu Sep 15 22:45:11 UTC 2005


On 9/15/05, Deron Meranda <deron.meranda at gmail.com> wrote:
> > > > > > I have got a pdf file, whose text I would like to copy to a word
> > > > > > processor. However, it seems to be protected, as when I copy and paste
> > > > > > a piece of text from there into a word processor, I only see garbage.
> ...
> > Thanks, Leonard. I have just checked: the pdf file is not copy
> > protected, but, even so, what I can copy into a word processor is
> > garbage. It may be something relating with encodings.
> 
> It could be encodings.  Text in PDF is really only in terms of glyphs,
> not characters, which makes text extraction particularly difficult
> and font-specific.  Fortunately there are a few standard PDF encodings
> defined by Adobe (these map "characters" to glyphs, and are not
> quite the same things as you'd think of an "encoding" being), but
> each PDF file can create it's own custom encodings as well and
> visually you'd see nothing different.  There's also nothing to keep
> the "text" in a PDF file from being written weird (such as writing
> from right-to-left) since it's just graphics instructions; but most PDF
> generating programs do it in the obvious way.
> 
> You might want to look at the "pdftotext" program (which is part of
> the xpdf package, obsoleted in FC4).  It generally can do a good job
> of extracting text.
> 
> Just some more information... are your documents generally
> written in English (or use the English alphabet)?  And are they more
> like plain prose (paragraphs of text), or fanciful like marketing marterials
> with lots of interspersed graphics, panels, and so forth?

Thanks, Deron. My documents are not written in English, and they only
have text and tables, apparently created with MS Windows. pdftotext
and pdftohtml do not produce good or reasonable results.

Paul




More information about the users mailing list