On Tue, Aug 31, 2010 at 09:54:07AM +0200, Florian La Roche wrote:
koji already has a generic routing called fixEncoding() and I am currently
running koji with the patch from https://bugzilla.redhat.com/show_bug.cgi?id=614767
Maybe your below code to use "errors='replace'" should go into the
generic koji function? (Haven't looked into this myself if it makes sense...)
Took a look -- fixEncoding() is almost as robust as my code_. I think I
prefer my code but the result of either transformation will be mangled byte
strings. Basically we're being handed byte strings without a guarantee that
the strings are utf8. So we first assume that they are utf8. If not, both
pieces of code try someway to coerce the output to utf8. My code substitutes a
replacement character for every illegal byte. So a string could end up looking
like this 'café' (in some non-utf8 encoding): 'caf�'.
The fixEncoding() code fallsback to an encoding that covers the whole
spectrum of possible bytes. So the output will substitute a random
character (or characters if the invalid characters are a multi-byte
sequence) in place of the unknown bytes.
It probably makes sense to use fixEncoding() everywhere as the patch in your
bug report does. It could make sense to change fixEncoding to incorporate
ideas from my code. The result would look like this:
def fixEncoding(value, from_encoding=None, fallback=None):
# fallback is used for backwards compatibility
if not from_encoding:
warnings.warn('fixEncoding() no longer takes a fallback'
' keyword arg. Use from_encoding instead.',
from_encoding = fallback
from_encoding = 'utf8'
if isinstance(value, unicode):
# value is already unicode, so just convert it
# to a utf8-encoded str
# Note: with python3, this can fail unless you use an error
# argument because a unicode string could have been created using
# the surrogateescape error handler.
return value.encode('utf8', 'replace')
# value is a str but may not be valid utf8 (encoded in latin1, for
# instance). Note that the string is almost certain to be mangled
# in these instances unless you know what encoding the string is in
# and have set from_encoding to that encoding.
return value.decode(from_encoding, 'replace').encode('utf8',
: fixEncoding's ``if not value: return value`` statement will lead to
unicode errors in circumstances like this::
monkeywrench = fixEncoding(u'')
name = fixEncoding(u'café')
combined = monkeywrench + name
The empty string in monkeywrench won't be converted into a byte string.
When you then try to combine the unicode string in monkeywrench with the
byte string in name python will try to convert name into a unicode string
with the ascii codec and fail.
Florian La Roche
> dgilmore says that we haven't applied any changes to the koji running in
> Fedora Infrastructure but that he needs to pull in a patch to hotfix our
> instance with. Since I didn't see a patch in the upstream git repository,
> I'm attaching one here.
> From a3c0248a10c5ddbe9a34f76b995fbf10c34e87f5 Mon Sep 17 00:00:00 2001
> From: Toshio Kuratomi <toshio(a)fedoraproject.org>
> Date: Fri, 20 Aug 2010 11:17:43 -0400
> Subject: [PATCH] Fix Unicode traceback with notification email.
> builder/kojid | 5 ++++-
> 1 files changed, 4 insertions(+), 1 deletions(-)
> diff --git a/builder/kojid b/builder/kojid
> index 2f7f480..77ded36 100755
> --- a/builder/kojid
> +++ b/builder/kojid
> @@ -3455,7 +3455,10 @@ Status: %(status)s\r
> message = self.message_templ % locals()
> # ensure message is in UTF-8
> - message = message.encode('utf-8')
> + if isinstance(message, unicode):
> + message = message.encode('utf-8')
> + else:
> + message = unicode(message, encoding='utf8',
> server = smtplib.SMTP(options.smtphost)
buildsys mailing list