Multirelease effort: Moving to Python 3

Tue Jul 23 05:30:49 UTC 2013

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 07/23/2013 12:42 AM, Toshio Kuratomi wrote:
> On Mon, Jul 22, 2013 at 05:15:50PM +1000, Nick Coghlan wrote:
>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
>> 
>> On 07/22/2013 03:25 PM, Toshio Kuratomi wrote:
>> 
>>> If python3 could just finally fix outputting text with 
>>> surrogateescaped bytes then it would finally clean up the last 
>>> portion of this and I would be able to stop pointing out the 
>>> various ways that python3's unicode handling is just as broken
>>> as pyhton2's -- just in different ways. :-)
>> 
>> Attempting to encode data containing surrogate escapes without
>> setting "errors=surrogateescape" is a sign that tainted data has
>> escaped somewhere. So it's late notification of an error, but
>> still indicative of an error somewhere. We'll never silence it by
>> default.
>> 
> That's a bit simplified from what python3's direction on this is
> unless Victor Stinner's work is only intended to be temporary.
> 
> $ export LC_ALL=en_US.utf-8 $ mkdir abc$'\xff' $ python3.3
>>>> import os se_dirlisting = os.listdir('.') # surrogateescape
>>>> in a text string: repr(se_dirlisting[0])
> "'abc\\udcff'"
>>>> # This doesn't traceback and it has to encode se_dirlisting
>>>> when passing # it out of python: 
>>>> os.listdir(se_dirlisting[0])
> []
>>>> # Works with other modules as well: import subprocess 
>>>> subprocess.call(['ls', se_dirlisting[0]])
> 0

For some APIs we set surrogateescape on the user's behalf. It's still
getting set, though :)

> AFAIK, the justification is that the surrogateescape'd strings are
> both coming from and going to the OS.  They're crossing outside of
> the line that python3 draws around itself and there's an implicit
> encoding and decoding there.  This seems fine to me as a strategy.
> The problems are just that there are places where python3 doesn't
> yet use surrogateescape when crossing this boundary.

Strictly speaking, there are a bunch of interfaces that we declare as
operating based on "os.fsencode" and "os.fsdecode". The fact we're not
especially clear on which encoding/decoding strategy we use for
particular APIs is the docs gap Armin was talking about in
http://lucumr.pocoo.org/2013/7/2/the-updated-guide-to-unicode/

The following (if I recall correctly) all use os.fsencode/decode:

  os.environ
  os.listdir
  sys.argv
  os.exec* functions
  subprocess environ
  subprocess arguments

We *should* use os.fsdecode to ensure file name attributes are always
unicode, but currently do not (I just filed a bug for that:
http://bugs.python.org/issue18534).

> The one I was specifically thinking of when I wrote this was the
> print() function:
> 
>>>> print(se_dirlisting[0])
> Traceback (most recent call last): File "<stdin>", line 3, in
> <module> UnicodeEncodeError: 'utf-8' codec can't encode character
> '\udcff' in position 3: surrogates not allowed

Unfortunately, the standard streams *don't* currently use the same
scheme as we assume for other operating system APIs. The reason we
don't is because we're reluctant to assume that all data received over
those streams will be in an ASCII or UTF-8 compatible encoding (e.g.
our feedback from Japan is that there are still plenty of systems
there using non-ASCII compatible encodings)

> (When I mentioned this at pycon you brought up: 
> http://bugs.python.org/issue15216 which looked promising but seems
> to have stalled ;-)

Alas, there's currently no champion to drive it. I'm interested, but
don't have enough time (improving the Unicode handling is third on the
list of "big problems in Python" that I currently care about, after
packaging and the initialisation code). Several of the other core devs
are sufficiently dubious of the notion of allowing mojibake to be
created silently that they're conflicted on offering the feature at
all, and thus not motivated to work on it :(

Since absolutely nobody in the world cares enough about the CPython
upstream to pay *anyone* to work on it full time (not even the Linux
distros or the members of the OpenStack foundation), this situation is
unlikely to change any time soon :(

>>>> * Inputs tainted with different assumptions? Immediate 
>>>> ValueError complaining about the taint mismatch
>>>> 
>>>> String encoding would be updated to trigger a ValueError
>>>> when asked to encode tainted strings to an encoding other
>>>> than the tainted one.
>>>> 
>>> 
>>> I'm a little leery of these.  The reason is that after using
>>> both python2 and the early versions of python3 I became a firm
>>> believer that the problem with python2's unicode handling
>>> wasn't that it threw exceptions, rather the problem was that
>>> the same bit of code was too prone to passing through certain
>>> data without error and throwing an error with other data.
>>> Programmers who tested their code with only ascii data or only
>>> data encoded in their locale's encoding, or only when their
>>> locale was a utf-8 encoding were unable to replicate or
>>> understand the errors that their user's got when they ran them
>>> in the crazy real-world environments that user's inevitably
>>> have.  These rules that throw an Exception suffer from the same
>>> reliance on the specific data and environment and will lead to
>>> similar tracebacks that programmers won't be able to easily 
>>> replicate.
>> 
>> You can't get away from this problem: decoding with the wrong
>> encoding corrupts your data.
>> 
> Incorrect.  Decoding with the wrong encoding and then attempting to
> operate on the decoded data in certain way will corrupt your data.
> Operating on it in certain other ways will not.  surrogateescape
> was designed because people realized that "round tripping" the data
> was a desirable feature.  Your thoughts on tainting still allow for
> manipulating strings with other strings that are tainted in the
> same manner.

Agreed, there should have been a "potentially" in there, since it
depends on what you do with it.

> MvL then produced the surrogateescape handler which was supposed to
> address these things by allowing people to round trip undecodable
> bytes into text strings and back out again.  This was good in that
> we no longer *silently* threw data away just because we couldn't
> decode it.  But it was bad as it reintroduced the python2 problem
> of having a valid text string that portions of the standard python3
> framework (mostly stdlib functions) could not handle.  This meant
> that the programmer could test print(os.listdir()) on their system
> where all filenames were utf-8 and things would work.  But a user
> could then run the same code on a system where the locale did not 
> match with the encoding of all the filenames and would get a
> traceback. This was a portion of the python2 wart resurfaced.

Right, but the data driven aspect is largely unavoidable at this point
- - until we get data that doesn't match the system configuration as we
understand it, then as far Python knows, the system configuration is
correct.

> Victor Stinner's work since then to integrate surrogateescape into
> more stdlib functions has helped immensely to craft a unified
> strategy for undecodable bytes.  print() still throws a traceback
> but most other places where we take in and send out undecodable
> bytes just work.

Yes, I agree printing to stdout is still a problem. That's why I wrote
http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html
and am in favour of making it easy to change the encoding and error
settings of an existing stream.

> Now, I'm not saying that the idea of annotating surogateescaped
> data with information about the encoding it was created using is
> bad.  I am *leery* (not yet convinced that the proposal is safe but
> also not yet convinced that it is harmful... just knowing that
> there's a potential bad interaction there) of throwing an exception
> as a result of examining that data as it brings back a portion of
> the python2 wart of throwing an exception late in the process
> rather than early, when the data is entering the system.  But, just
> like surrogateescape itself, I can see that the idea is that the
> error cases are supposed to shrink with this iteration -- at some
> point the hope would be that the error conditions are small enough
> that the programmer no longer has to care about them and perhaps
> this is that tipping point.

Right, the situation we genuinely care about is that if a system is
properly configured to use UTF-8 for everything, then it should all be
fine and wonderful. Everything beyond that is a best effort attempt to
better cope with what we consider to be legacy system configurations.

> Some comments though -- if you're going to throw an error, don't
> throw a ValueError.  It's immensely useful in python2 to get
> something (UnicodeError) that is only thrown by an attempt to
> transform the data wrongly for two reasons:  Lazy people can just
> catch UnicodeError and be done with it.  People debugging issues
> can immediately see that the bug falls into a (relatively) narrow
> space of issues.

Yeah, UnicodeError or a new TaintError would be better.

> If someone creates their own surrogateescaped string: se_str =
> 'abc\udcff' the taint-checking machinery should allow that to be
> combined with strings from any other sources.  It likely means that
> the programmer is working around an API they don't control that
> takes text strings but should take a bytes string.

Sure, I think tainting should be a separate mechanism. It's also
purely hypothetical vapourware at this point, unless/until someone
else finds it interesting enough to try to implement it.

Cheers,
Nick.

- -- 
Nick Coghlan
Red Hat Infrastructure Engineering & Development, Brisbane

Testing Solutions Team Lead
Beaker Development Lead (http://beaker-project.org/)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJR7hUJAAoJEHEkJo9fMO/LpA0H/jeHbKG3ITEPuF91o0Xkvu2Z
Bxogc8BOd2qsavRdCToTR6Cw98uKRjSEerjAfhP3he/X4MLqlSmz/nMi5rzxwodH
us0okSSggKjo3blWumonn9MafJqRI9JcuK6J2iHi6Nq15L4yzEIybK/2AcM//Hyk
NJfqWUEnjlU0Nb0Wk8t9q2QIg+NmIZKwIMRitaAPC+QDPuT6fF/FiESrDQpyJbGw
OVga6x3V8xxCHQD0GnXktavGhg9SE0sN1w+gAXsb5c90c5Ns6vI/YDgQqe3YRbol
O6VkwJv2XtcNTWymywzHOCqQPZ1kjmDmSeOEoT1AimMMcfmwQBdRjlKLXT5WFEA=
=OxWx
-----END PGP SIGNATURE-----