[Fedora-packaging] file-not-utf8 complaints

Toshio Kuratomi a.badger at gmail.com
Sat May 31 23:09:25 UTC 2008


Patrice Dumas wrote:
> On Fri, May 30, 2008 at 06:56:33PM -0700, Toshio Kuratomi wrote:
>> Reencoding the xml files that specify an encoding isn't strictly  
>> necessary.  We should probably ask upstream whether they are amenable to  
> 
> I think that reencoding files that carry over the encoding information
> (info, texinfo, tex and xml for example) is wrong. It is better to let
> upstream do whatever they want. Same for examples of code, better leave
> the encoding preferred by upstream.
> 
> For NEWS/Changelog, other text files in %doc and also man pages that are
> not installed in a non utf8 locale, I agree that converting to UTF-8 is 
> better.
> 
I'm almost in complete agreement with you.  The one extra piece that I 
think should be considered is how the text is normally viewed/edited.

For instance, if a program has a plain text data file and the program 
expects the data to be encoded in utf-16 that should stay utf-16.  Since 
the end user never views the file and the program has an expectation of 
what's in it, this should be perfectly acceptable.

However, the flipside of this is if a program has an xml config file 
that the user is expected to edit manually in a text editor and the 
program will adapt to multiple encodings (for instance, by using libxml2 
to parse the file[1]_) having it exist in utf-8 is much better than 
having it exist in SOME_EXOTIC_ENCODING.  In this case it's the program 
that doesn't care that the config file is in utf-8 vs SHIFT-JIS.  But 
the user that opens the file in a text editor will be presented with 
garbage if the text does not match the system default encoding.  Yes, 
the user can manually change the encoding that is displayed and saved in 
some editors but:

1) This is not the full range of editors.

2) The user has to learn to enable the new encoding in their editor. 
This involves reading, editing, and saving.  Some editors will display 
garbage unless you set the correct encoding on startup, others can 
change while running; some convert on open with a best guess at what the 
bytes mean but you have to specify what encoding to save the result 
otherwise you get the default (utf-8 or dependent on your locale settings).

3) If the user wants to use characters that are not present in the 
encoding the file is written in (for instance, the file is encoded in 
KOI8-R but the user wants to use kanji.) They'll have to convert the 
file to a unicode family of encodings and edit the header that tells the 
character set to use before making their changes.

So really, whether the user is intended to edit/view the file directly 
instead of through a program that can change the encoding appropriately 
should be the dividing line rather than whether the format specifies the 
  encoding/does not specify encoding.

.. _[1]: http://xmlsoft.org/encoding.html#Default

Whether this is something we should do in our packages even if upstream 
doesn't accept the changes involves other factors.  In the case of 
documentation files that have no encoding we should convert whether or 
not upstream agrees.  In the case of documentation that does specify the 
encoding I lean towards converting [2]_.  In the case of a file that is 
used by a program we should definitely have a conversation with upstream 
about it, although we could convert locally with upstream's blessing 
(ie: Upstream says: "I'm going to continue writing my xml config file in 
latin-1.  If you want to convert them to utf-8 for your users that's 
fine -- I'm going to continue to use a library for xml parsing that 
understands encodings.")

.. _[2]: Note that this is only for documentation which is not supposed 
to be viewed directly.  xhtml, for instance, is normally going to be 
viewed in a browser so this would not apply.

-Toshio

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: OpenPGP digital signature
Url : http://lists.fedoraproject.org/pipermail/packaging/attachments/20080531/6a70b0c0/attachment.bin 


More information about the packaging mailing list