[sugar] Python Style Guide

Ian Bicking ianb
Tue Nov 14 13:28:40 EST 2006


Dan Williams wrote:
>> Not directly your point, but I do think we need to be very careful about 
>> avoiding encoding strings, doing the decoding at the boundaries whenever 
>> possible (where dbus is a boundary).
> 
> I'd advocate mandating UTF-8 everwhere, but that's just me...  Is there
> a way to make python's constant strings (ie, 'a = "something"') always
> be Unicode objects?

Well, "inside" the Python process there are actual unicode strings, 
which aren't encoded in any form.  And there do exist strings which 
can't be decoded, because they are not textual data.

Anyway, about encoding:

* A constant string is just data, it doesn't hold any encoding 
information.  So there's no way to indicate what encoding it has.  If 
you know its encoding, you should probably just decode it.

* A constant unicode string (u"") doesn't have encoding either, it's 
just unicode.  The *source* is encoded in some way.  If you include the 
UTF8 marker at the beginning of a Python source file, it will be assumed 
to be UTF8 content.  This is what I advocate for OLPC (in the style 
guide here: 
http://wiki.laptop.org/go/Python_Style_Guide#Encodings_.28PEP_263.29). 
Mostly we just have to make sure the editing tools included with the 
laptop produce and preserve that UTF8 marker.

* I just tested it, and a non-unicode constant string also gets the 
encoded data.  So if you include something like s = "t?st" in your 
source (with the UTF8 marker), the result is s == 't\xc9\x9bst'.  Though 
we really should be using unicode strings for our textual data.  But if 
we have some non-unicode docstring and someone puts some unicode data in 
it (e.g., an author name), it at least won't break, and that's good.

* Some textual strings can't be unicode; specifically Python 
identifiers.  So, for instance, "obj.__dict__[u't?st'] = 'foo'" is 
illegal, because objects can't have unicode attributes.  In these cases 
the attributes should simply be ASCII.

In Python 3 normal constant strings will all be unicode strings, and 
there will be a different syntax for binary literals (or maybe no syntax 
at all).  But that definitely won't happen in the Python 2 line.

There is a way to set the default encoding to utf8 (by default it is 
ascii), but that introduces lots of weird artifacts and is strongly 
discouraged.  The default encoding matters when you do comparisons 
between strings and a few other situations -- generally either the str 
has to be decoded to a unicode object, or the unicode object encoded 
with the default encoding before they can be usefully compared.  You can 
disable the default encoding entirely, but I suspect it would break far 
too many things to do so.  (Though I suppose we could also make the 
default encoding something that will produce a warning, and at least get 
some indication of possible encoding errors.)

Generally if we test with non-ASCII data we'll see encoding problems 
fairly early.  With ASCII data encoding problems can stay hidden quite 
easily.  English speakers tend not to create non-ASCII test data, which 
is a problem.

-- 
Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org


More information about the Sugar-devel mailing list