Copying text from a protected pdf file

Paul Smith phhs80 at gmail.com
Fri Sep 16 12:33:24 UTC 2005


On 9/16/05, Antonio Olivares <olivares14031 at yahoo.com> wrote:
> > > > > > > > I have got a pdf file, whose text I
> > would like to copy to a word
> > > > > > > > processor. However, it seems to be
> > protected, as when I copy and paste
> > > > > > > > a piece of text from there into a word
> > processor, I only see garbage.
> > > ...
> > > > Thanks, Leonard. I have just checked: the pdf
> > file is not copy
> > > > protected, but, even so, what I can copy into a
> > word processor is
> > > > garbage. It may be something relating with
> > encodings.
> > >
> > > It could be encodings.  Text in PDF is really only
> > in terms of glyphs,
> > > not characters, which makes text extraction
> > particularly difficult
> > > and font-specific.  Fortunately there are a few
> > standard PDF encodings
> > > defined by Adobe (these map "characters" to
> > glyphs, and are not
> > > quite the same things as you'd think of an
> > "encoding" being), but
> > > each PDF file can create it's own custom encodings
> > as well and
> > > visually you'd see nothing different.  There's
> > also nothing to keep
> > > the "text" in a PDF file from being written weird
> > (such as writing
> > > from right-to-left) since it's just graphics
> > instructions; but most PDF
> > > generating programs do it in the obvious way.
> > >
> > > You might want to look at the "pdftotext" program
> > (which is part of
> > > the xpdf package, obsoleted in FC4).  It generally
> > can do a good job
> > > of extracting text.
> > >
> > > Just some more information... are your documents
> > generally
> > > written in English (or use the English alphabet)?
> > And are they more
> > > like plain prose (paragraphs of text), or fanciful
> > like marketing marterials
> > > with lots of interspersed graphics, panels, and so
> > forth?
> >
> > Thanks, Deron. My documents are not written in
> > English, and they only
> > have text and tables, apparently created with MS
> > Windows. pdftotext
> > and pdftohtml do not produce good or reasonable
> > results.
> 
> Have you tried converting your file to postscript and
> then using ps2ascii or something similar?

Yes, Antonio, I tried that, but with no better results.

Paul




More information about the users mailing list