Copying text from a protected pdf file

Paul Smith phhs80 at gmail.com
Sat Sep 17 21:53:35 UTC 2005


On 9/16/05, George White <aa056 at chebucto.ns.ca> wrote:
> > I have got a pdf file, whose text I would like to copy to a word
> > processor. However, it seems to be protected, as when I copy and paste
> > a piece of text from there into a word processor, I only see garbage.
> > Is there some way of getting clean text from the pdf file?
> 
> The PDF format has many ways to display text.  To be able to extract text
> you need a file that stores strings and uses font information to render them
> in the viewer.  You may be seeing images that were rasterized long ago.
> You should provide the output of the "pdffonts" command, preferrable for a
> minimal document (a big document could combine sections that use fonts with
> images).
> 
> For example, the simplest case is a document that uses the PostScript Type 1
> fonts provided by the viewer:
> 
> $ pdffonts /usr/share/doc/cups-1.1.20/ssr.pdf
> name                                 type         emb sub uni object ID
> ------------------------------------ ------------ --- --- --- ---------
> Times-Roman                          Type 1       no  no  no       4  0
> Helvetica                            Type 1       no  no  no       7  0
> Helvetica-Bold                       Type 1       no  no  no       8  0
> Times-Bold                           Type 1       no  no  no       5  0
> Courier                              Type 1       no  no  no       3  0
> Symbol                               Type 1       no  no  no       9  0
> Times-Italic                         Type 1       no  no  no       6  0
> 
> 
> --
> George N. White III
> Head of St. Margarets Bay, Nova Scotia
> 
> --

Thanks, George. In my case,

$ pdffonts myfile.pdf
name                                 type         emb sub uni object ID
------------------------------------ ------------ --- --- --- ---------
DTUUBE+TTBC19E318t00                 TrueType     yes yes no      13  0
URMVBE+TTBC18C910t00                 TrueType     yes yes no      16  0
TOYVBE+Symbol                        Type 1C      yes yes no      19  0
Helvetica                            Type 1C      yes no  no      22  0
CLLUBE+TTBC1802E0t00                 TrueType     yes yes no      34  0
Helvetica-Bold                       Type 1C      yes no  no      43  0
Helvetica-Oblique                    Type 1C      yes no  no      58  0
$    

Paul




More information about the users mailing list