pdftohtml encoding question

Tue Mar 11 12:40:21 UTC 2008

On 3/10/08, François Patte <francois.patte at math-info.univ-paris5.fr> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> bonsoir,
>
> I am trying to convert a pdf file into html using pdftohtml provided by f8.
>
> I get an html file with "nice" characters like: â€™ insead of apostroph,
> or Ã(c) instead of é...
>
> so i think that there is some coding problem.
>
> Using man pdftohtml, I got this info:
> - -enc <string>
> ~ output text encoding name
>
>
> but, I am unable to guess what is the syntax to use in order to have a
> correct output in utf8 for:
>
> Error: Couldn't find unicodeMap file for the 'utf8' encoding
>
> is the only answer I get if I try:
>
> pdftohtml -enc utf8 myfile.pdf
>
>
> i tried utf-8, latin1, latin-1, ISO_8859-1, .... without any success.
>
>
> If somebody knows... many thnaks in advance.

I don't, but

man pdftohtml

 ->  Pdftohtml was developed by Gueorgui Ovtcharov and Rainer Dorsch. It  is
     based and benefits a lot from Derek Noonburg?s xpdf package.

man xpdf

 ->  -enc encoding-name
          Sets the encoding to use for  text  output.   The  encoding-name
          must  be  defined  with  the unicodeMap command (see xpdfrc(5)).
          This defaults to "Latin1" (which is a built-in encoding).  [con-
          fig file: textEncoding]

man xpdfrc

 ->  unicodeMap encoding-name map-file
          [...]
          The Latin1, ASCII7, Symbol, ZapfDingbats,  UTF-8,  and
          UCS-2 encodings are predefined.

I'm afraid you'll have to read the elided part if you need an encoding
other than these six.

Hope this helps,

Andras