Multirelease effort: Moving to Python 3

Mon Jul 22 05:25:35 UTC 2013

On Mon, Jul 22, 2013 at 10:15:31AM +1000, Nick Coghlan wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 07/20/2013 06:11 AM, Toshio Kuratomi wrote:
> > pythonic is a very vague statement and I wouldn't consider most of
> > your list to be examples of those.  Yes, python3 may be a *better*
> > language (and I would include most of your list as "features of
> > python3 that python2 does not have) but a more pythonic language...
> > that's not something that you can readily measure.  For instance, I
> > can make the case that python3's unicode handling is less pythonic
> > than python2 as it violates three rules of the zen of python:
> > 
> > Explicit is better than implicit. Errors should never pass
> > silently. In the face of ambiguity, refuse the temptation to
> > guess.
> > 
> > (To be fair, python2 violated some of these rules in its unicode
> > handling as well, although errors should never pass silently would
> > probably take some work to convince most people :-)
> 
> The *only* reason Python 3 allows any Unicode errors to pass silently
> is because Python 2 tolerated broken system configurations (like
> non-UTF-8 filesystem metadata on nominally UTF-8 systems) by treating
> them as opaque 8-bit strings that were retrieved from OS interfaces
> and then passed back unmodified (see PEP 383 for details). If Python 3
> didn't work on those systems, people would blame Python 3, not the
> already broken system configuration ("But Python 2 works, why is
> Python 3 so broken?"). os.listdir() -> open() is the canonical example
> of the kind of "round trip" activity that we felt we needed to support
> even for systems with improperly encoded metadata (including file names).
> 

Actually, surrogateescape is a *great* improvement over the previous python3
behaviour of silently dropping data that it did not understand.

If python3 could just finally fix outputting text with surrogateescaped
bytes then it would finally clean up the last portion of this and I would be
able to stop pointing out the various ways that python3's unicode handling
is just as broken as pyhton2's -- just in different ways. :-)

> 
> Tainting would involve having the surrogateescape codec set an
> attribute on a string recording the encoding assumption if it had to
> embed any surrogates in the Private Use Area, as well as a keyword
> only "taint" argument to decode operations (e.g. to force tainting
> when using "latin-1" as a universal text codec). Various string
> operations would then be modified to use the following rules:
> 
> * Both input strings untainted? Output is untainted.
> * One input tainted, one untainted? Output is tainted with the same
> assumption as the tainted input
> * Both inputs tainted with the same assumption? Output is also tainted
> with that assumption.
>
This sounds like it might be nice.  The one thing I'm a little unsure about
is that it sounds like code is going to have to handle this explicitly.
Judging from the way all but a select few people handle Text vs encoded
bytes right now, that seems like it won't achieve very much.  OTOH, I could
see this as being an additional bit of information that's entirely optional
whether people use it.  I think that could be helpful in some cases of
debugging.  (OTOH, often when encoding vs text issues arise it's because the
coder and program have no way to know the correct encoding.  When that
happens, so the extra information might not be that useful for the majority
of cases anyway).

> * Inputs tainted with different assumptions? Immediate ValueError
> complaining about the taint mismatch
> 
> String encoding would be updated to trigger a ValueError when asked to
> encode tainted strings to an encoding other than the tainted one.
> 

I'm a little leery of these.  The reason is that after using both python2
and the early versions of python3 I became a firm believer that the problem
with python2's unicode handling wasn't that it threw exceptions, rather the
problem was that the same bit of code was too prone to passing through
certain data without error and throwing an error with other data.
Programmers who tested their code with only ascii data or only data encoded
in their locale's encoding, or only when their locale was a utf-8 encoding
were unable to replicate or understand the errors that their user's got when
they ran them in the crazy real-world environments that user's inevitably
have.  These rules that throw an Exception suffer from the same reliance on
the specific data and environment and will lead to similar tracebacks that
programmers won't be able to easily replicate.

-Toshio
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://lists.fedoraproject.org/pipermail/python-devel/attachments/20130721/4c4f8233/attachment.sig>