Copying text from a protected pdf file

Wed Sep 21 12:01:14 UTC 2005

On 9/21/05, George White <aa056 at chebucto.ns.ca> wrote:
> > > > > I have got a pdf file, whose text I would like to copy to a word
> > > > > processor. However, it seems to be protected, as when I copy and paste
> > > > > a piece of text from there into a word processor, I only see garbage.
> > > > > Is there some way of getting clean text from the pdf file?
> > > >
> > > > The PDF format has many ways to display text.  To be able to extract
> > text
> > > > you need a file that stores strings and uses font information to render
> > them
> > > > in the viewer.  You may be seeing images that were rasterized long ago.
> > > > You should provide the output of the "pdffonts" command, preferrable for
> > a
> > > > minimal document (a big document could combine sections that use fonts
> > with
> > > > images).
> > > >
> > > > For example, the simplest case is a document that uses the PostScript
> > Type 1
> > > > fonts provided by the viewer:
> > > >
> > > > $ pdffonts /usr/share/doc/cups-1.1.20/ssr.pdf
> > > > name                                 type         emb sub uni object ID
> > > > ------------------------------------ ------------ --- --- --- ---------
> > > > Times-Roman                          Type 1       no  no  no       4  0
> > > > Helvetica                            Type 1       no  no  no       7  0
> > > > Helvetica-Bold                       Type 1       no  no  no       8  0
> > > > Times-Bold                           Type 1       no  no  no       5  0
> > > > Courier                              Type 1       no  no  no       3  0
> > > > Symbol                               Type 1       no  no  no       9  0
> > > > Times-Italic                         Type 1       no  no  no       6  0
> > >
> > > Thanks, George. In my case,
> > >
> > > $ pdffonts myfile.pdf
> > > name                                 type         emb sub uni object ID
> > > ------------------------------------ ------------ --- --- --- ---------
> > > DTUUBE+TTBC19E318t00                 TrueType     yes yes no      13  0
> > > URMVBE+TTBC18C910t00                 TrueType     yes yes no      16  0
> > > TOYVBE+Symbol                        Type 1C      yes yes no      19  0
> > > Helvetica                            Type 1C      yes no  no      22  0
> > > CLLUBE+TTBC1802E0t00                 TrueType     yes yes no      34  0
> > > Helvetica-Bold                       Type 1C      yes no  no      43  0
> > > Helvetica-Oblique                    Type 1C      yes no  no      58  0
> > > $
> >
> > Is it possible to find the missing fonts to install them?
>
> Do you have a friend at the No Such Agency?
>
> The four embedded subsets will be a problem.  When you extract text from a PDF
> file you don't get encoding or font information, so even if the fonts are
> installed you would have to manually assign the font to each fragment.  A
> subsetted font may not use any recognizable encoding.   I have some where it
> appears that the subsets are encoded starting with ASCII control-character
> codes (e.g., 0x01, 0x02, ...).  If you are dealing with normal text, you might
> be dealing with a simple substitution code.  Try constructing a
> table by working with short strings from text that seems to be in the same
> font.
>
> I'm looking at a document where "off" becomes "<ACK><BEL><BEL>", so my table
> would have:
>
>   o -> 6
>   f -> 7

Thanks, George. Now, I understand how complicated is to achieve my
goal, and therefore it is better to give up!

Paul