pdftohtml encoding question

Andras Simon szajmi at gmail.com
Tue Mar 11 12:40:21 UTC 2008


On 3/10/08, François Patte <francois.patte at math-info.univ-paris5.fr> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> bonsoir,
>
> I am trying to convert a pdf file into html using pdftohtml provided by f8.
>
> I get an html file with "nice" characters like: ’ insead of apostroph,
> or Ã(c) instead of é...
>
> so i think that there is some coding problem.
>
> Using man pdftohtml, I got this info:
> - -enc <string>
> ~ output text encoding name
>
>
> but, I am unable to guess what is the syntax to use in order to have a
> correct output in utf8 for:
>
> Error: Couldn't find unicodeMap file for the 'utf8' encoding
>
> is the only answer I get if I try:
>
> pdftohtml -enc utf8 myfile.pdf
>
>
> i tried utf-8, latin1, latin-1, ISO_8859-1, .... without any success.
>
>
> If somebody knows... many thnaks in advance.

I don't, but

man pdftohtml

 ->  Pdftohtml was developed by Gueorgui Ovtcharov and Rainer Dorsch. It  is
     based and benefits a lot from Derek Noonburg?s xpdf package.

man xpdf

 ->  -enc encoding-name
          Sets the encoding to use for  text  output.   The  encoding-name
          must  be  defined  with  the unicodeMap command (see xpdfrc(5)).
          This defaults to "Latin1" (which is a built-in encoding).  [con-
          fig file: textEncoding]

man xpdfrc

 ->  unicodeMap encoding-name map-file
          [...]
          The Latin1, ASCII7, Symbol, ZapfDingbats,  UTF-8,  and
          UCS-2 encodings are predefined.

I'm afraid you'll have to read the elided part if you need an encoding
other than these six.

Hope this helps,

Andras




More information about the users mailing list