I'm getting warning below from rpmlint about "file-not-utf8".
cduce-devel.x86_64: W: file-not-utf8 /usr/share/doc/cduce-devel-0.5.0/tutorial_getting_started.html cduce-devel.x86_64: W: file-not-utf8 /usr/share/doc/cduce-devel-0.5.0/manual_expressions.html cduce-devel.x86_64: W: file-not-utf8 /usr/share/doc/cduce-devel-0.5.0/AUTHORS
The AUTHORS file is fair enough - I can use iconv to convert that to UTF-8. However I'm concerned about the HTML files. These are ISO-8859-1 files, and moreover they contain correct Content-Type metadata to mark them as such so I can't see there is a problem with these two files not being UTF-8.
Do I need to worry about it?
Rich.
On Thu, 2007-09-13 at 17:33 +0100, Richard W.M. Jones wrote:
The AUTHORS file is fair enough - I can use iconv to convert that to UTF-8. However I'm concerned about the HTML files. These are ISO-8859-1 files, and moreover they contain correct Content-Type metadata to mark them as such so I can't see there is a problem with these two files not being UTF-8.
I don't think there's a _problem_ per se; but it's probably better to convert them anyway.
On Thursday 13 September 2007, David Woodhouse wrote:
On Thu, 2007-09-13 at 17:33 +0100, Richard W.M. Jones wrote:
The AUTHORS file is fair enough - I can use iconv to convert that to UTF-8. However I'm concerned about the HTML files. These are ISO-8859-1 files, and moreover they contain correct Content-Type metadata to mark them as such so I can't see there is a problem with these two files not being UTF-8.
I don't think there's a _problem_ per se; but it's probably better to convert them anyway.
If you feel like it, why not. Be also sure to modify the meta http-equiv stuff to say UTF-8 if you do it (or use HTML entities to represent non-ASCII in which case I suppose you could also remove the meta tag).
But quite honestly, I don't think it's a problem at all to leave them as is if the meta charset declaration is correct. In fact, I'm going to suppress this message (as well as the end-of-line char one) for HTML files in upstream rpmlint right now.
Le jeudi 13 septembre 2007 à 20:18 +0300, Ville Skyttä a écrit :
But quite honestly, I don't think it's a problem at all to leave them as is if the meta charset declaration is correct. In fact, I'm going to suppress this message (as well as the end-of-line char one) for HTML files in upstream rpmlint right now.
I you were feeling evil, you'd have rpmlint rum tidy on (x)html files so problems are reported upstream.
On Thursday 13 September 2007 18:31:06 Nicolas Mailhot wrote:
I you were feeling evil, you'd have rpmlint rum tidy on (x)html files so problems are reported upstream.
Not only that but I remember to see html pages composed with latin1 and without the charset in metadata. So the warning has its uses. :-)
FWIW tidy will complain as well in this case. :-)
-- Nicolas Mailhot
On Thursday 13 September 2007, José Matos wrote:
On Thursday 13 September 2007 18:31:06 Nicolas Mailhot wrote:
I you were feeling evil, you'd have rpmlint rum tidy on (x)html files so problems are reported upstream.
Heh, actually doing a "tidy &>/dev/null" and whining if the exit status is not 0 could be nice :)
Not only that but I remember to see html pages composed with latin1 and without the charset in metadata. So the warning has its uses. :-)
Well, maybe, but an UTF-8 encoded HTML doc which lacks a declaration in which encoding it is will currently pass through rpmlint without warnings, and that's at least as bad as a ISO-8859-1 encoded HTML doc without it as far as HTML specs are concerned...
On Thu, 2007-09-13 at 18:35 +0100, José Matos wrote:
Not only that but I remember to see html pages composed with latin1 and without the charset in metadata. So the warning has its uses. :-)
Well... doesn't HTTP default to ISO8859-1 unless the charset is otherwise specified?
On Friday 14 September 2007, David Woodhouse wrote:
On Thu, 2007-09-13 at 18:35 +0100, José Matos wrote:
Not only that but I remember to see html pages composed with latin1 and without the charset in metadata. So the warning has its uses. :-)
Well... doesn't HTTP default to ISO8859-1 unless the charset is otherwise specified?
In the vast majority of cases, HTTP isn't involved with HTML files packaged as %doc. And even for HTML over HTTP, it's not quite that simple, see eg. http://www.w3.org/TR/html4/charset.html section 5.2.2.
Le Ven 14 septembre 2007 00:26, David Woodhouse a écrit :
On Thu, 2007-09-13 at 18:35 +0100, José Matos wrote:
Not only that but I remember to see html pages composed with latin1 and without the charset in metadata. So the warning has its uses. :-)
Well... doesn't HTTP default to ISO8859-1 unless the charset is otherwise specified?
HTTP yes but HTML no
-> see http://www.w3.org/TR/html4/charset.html
« The HTTP protocol ([RFC2616], section 3.7.1) mentions ISO-8859-1 as a default character encoding when the "charset" parameter is absent from the "Content-Type" header field. In practice, this recommendation has proved useless because some servers don't allow a "charset" parameter to be sent, and others may not be configured to send the parameter. Therefore, user agents must not assume any default value for the "charset" parameter. »
Also:
1. A lot of pages are not ISO8859-1 but ISO8859-15 or the windows latin variant, so *never* assume just because there is no charset declaration it's valid ISO8859-1
2. Default encoding is user-settable at the browser level and users do change the US-friendly ISO8859-1 default so any page without charset declaration will render wrongly on some systems
3. Local HTML pages are read without passing through HTTP so HTTP defaults do not apply
So any HTML page without charset definition should be treated as a bug (unless it's in a webapp which Apache config file forces a particular encoding, or it's a xhtml page with encoding specified at the XML level)
Nicolas Mailhot wrote:
Le jeudi 13 septembre 2007 à 20:18 +0300, Ville Skyttä a écrit :
But quite honestly, I don't think it's a problem at all to leave them as is if the meta charset declaration is correct. In fact, I'm going to suppress this message (as well as the end-of-line char one) for HTML files in upstream rpmlint right now.
I you were feeling evil, you'd have rpmlint rum tidy on (x)html files so problems are reported upstream.
These files are actually generated from XML sources using cduce[1], so they are well-formed XHTML 1.0 strict already.
Rich.
[1] cduce is like XSLT, but without the crack.