Multirelease effort: Moving to Python 3

Mon Jul 22 00:15:31 UTC 2013

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 07/20/2013 06:11 AM, Toshio Kuratomi wrote:
> pythonic is a very vague statement and I wouldn't consider most of
> your list to be examples of those.  Yes, python3 may be a *better*
> language (and I would include most of your list as "features of
> python3 that python2 does not have) but a more pythonic language...
> that's not something that you can readily measure.  For instance, I
> can make the case that python3's unicode handling is less pythonic
> than python2 as it violates three rules of the zen of python:
> 
> Explicit is better than implicit. Errors should never pass
> silently. In the face of ambiguity, refuse the temptation to
> guess.
> 
> (To be fair, python2 violated some of these rules in its unicode
> handling as well, although errors should never pass silently would
> probably take some work to convince most people :-)

The *only* reason Python 3 allows any Unicode errors to pass silently
is because Python 2 tolerated broken system configurations (like
non-UTF-8 filesystem metadata on nominally UTF-8 systems) by treating
them as opaque 8-bit strings that were retrieved from OS interfaces
and then passed back unmodified (see PEP 383 for details). If Python 3
didn't work on those systems, people would blame Python 3, not the
already broken system configuration ("But Python 2 works, why is
Python 3 so broken?"). os.listdir() -> open() is the canonical example
of the kind of "round trip" activity that we felt we needed to support
even for systems with improperly encoded metadata (including file names).

You can already force Python 3 into completely strict mode by doing:

    import codecs; codecs.register_error("surrogateescape",
codecs.strict_errors)

>>> b"\xe9".decode("ascii", errors="surrogateescape")
'\udce9'

>>> import codecs; codecs.register_error("surrogateescape",
>>> codecs.strict_errors) b"\xe9".decode("ascii",
>>> errors="surrogateescape")
>>> 
Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position
0: ordinal not in range(128)

I'd eventually like Python 3 to support a string tainting model, but
who knows when I'll find time to actually work on that :(

To clarify what I mean by "string tainting" (since it's more than the
simple tainted-or-not model used in other languages), currently Python
3 unicode strings may exist in a state similar to 8-bit strings in
Python 2: they're tainted by particular encoding assumptions, so
combining them with arbitrary other pieces of "text" or encoding them
to a different output format isn't a valid operation. Unfortunately,
Python 3, like Python 2, currently allows you to combine strings
tainted by such assumptions without complaint, unless/until you try to
encode them again and than you *might* get an error, or you might just
get invalid data. It's significantly less common in Python 3 than it
was in Python 2 (as it requires that you decode data using an encoding
that doesn't actually match the data contents, whereas in Python 2 it
just required combining 8-bit data that used two different encodings),
but the problem definitely isn't solved at this point, just mitigated.

Tainting would involve having the surrogateescape codec set an
attribute on a string recording the encoding assumption if it had to
embed any surrogates in the Private Use Area, as well as a keyword
only "taint" argument to decode operations (e.g. to force tainting
when using "latin-1" as a universal text codec). Various string
operations would then be modified to use the following rules:

* Both input strings untainted? Output is untainted.
* One input tainted, one untainted? Output is tainted with the same
assumption as the tainted input
* Both inputs tainted with the same assumption? Output is also tainted
with that assumption.
* Inputs tainted with different assumptions? Immediate ValueError
complaining about the taint mismatch

String encoding would be updated to trigger a ValueError when asked to
encode tainted strings to an encoding other than the tainted one.

Strings would likely gain a "remove_taint" method (name TBD), that did
the "encode using tainted encoding, redecode using correct encoding".

And yes, this could be used for traditional tainting as well - setting
the taint assumption to something like "<untrusted>" (e.g. for user
input) or "<sensitive>" (e.g. for not-yet-hashed user credentials)
would be enough to prevent serialisation under this scheme.

However, I have my hands full with packaging issues at the moment (see
PEP 426), and then I still want to fix the model for embedding CPython
(see PEP 432). So if it's left to me, there's no way this idea could
become reality before Python 3.5. I may at least lob it in the
direction of python-ideas, though, to see if someone else is prepared
to run with it...

Cheers,
Nick.

- -- 
Nick Coghlan
Red Hat Infrastructure Engineering & Development, Brisbane

Testing Solutions Team Lead
Beaker Development Lead (http://beaker-project.org/)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJR7HmjAAoJEHEkJo9fMO/LwggH/i2eMcsgjLZ1BhuaAJdhxB4+
wDmwKs3wGrmDjCh0JJoo49TiOlBhP5szN4YEqnKWTfmJdbg/pW2bLBaxeAxApVu1
EjXjo9jQR+fxj38D1yyLa8QHJxWbr70CmA7/K9aTDt+rvK83a8a2eIw1GjcnGrv4
6qOJ7E/vbkyCa31wRIXrb/OysizJbJRQd1+luEVWaI2yo9kM/ogcpfB4DyJVW7Sm
/DMvXSpkASPhqXgN9DhHcbLk5eY27rKqr9gZ2UuTt67XWXh1btKj/zc0WPWF7rX+
Twr2dckl0x0iHh637MWxGRjiMBqIBCTUOnQZ+WTjBhLsUUTyAH0yvHFUfuJoffo=
=85mI
-----END PGP SIGNATURE-----