[sugar] Python Style Guide
Ian Bicking
ianb
Tue Nov 14 13:28:40 EST 2006
Dan Williams wrote:
>> Not directly your point, but I do think we need to be very careful about
>> avoiding encoding strings, doing the decoding at the boundaries whenever
>> possible (where dbus is a boundary).
>
> I'd advocate mandating UTF-8 everwhere, but that's just me... Is there
> a way to make python's constant strings (ie, 'a = "something"') always
> be Unicode objects?
Well, "inside" the Python process there are actual unicode strings,
which aren't encoded in any form. And there do exist strings which
can't be decoded, because they are not textual data.
Anyway, about encoding:
* A constant string is just data, it doesn't hold any encoding
information. So there's no way to indicate what encoding it has. If
you know its encoding, you should probably just decode it.
* A constant unicode string (u"") doesn't have encoding either, it's
just unicode. The *source* is encoded in some way. If you include the
UTF8 marker at the beginning of a Python source file, it will be assumed
to be UTF8 content. This is what I advocate for OLPC (in the style
guide here:
http://wiki.laptop.org/go/Python_Style_Guide#Encodings_.28PEP_263.29).
Mostly we just have to make sure the editing tools included with the
laptop produce and preserve that UTF8 marker.
* I just tested it, and a non-unicode constant string also gets the
encoded data. So if you include something like s = "t?st" in your
source (with the UTF8 marker), the result is s == 't\xc9\x9bst'. Though
we really should be using unicode strings for our textual data. But if
we have some non-unicode docstring and someone puts some unicode data in
it (e.g., an author name), it at least won't break, and that's good.
* Some textual strings can't be unicode; specifically Python
identifiers. So, for instance, "obj.__dict__[u't?st'] = 'foo'" is
illegal, because objects can't have unicode attributes. In these cases
the attributes should simply be ASCII.
In Python 3 normal constant strings will all be unicode strings, and
there will be a different syntax for binary literals (or maybe no syntax
at all). But that definitely won't happen in the Python 2 line.
There is a way to set the default encoding to utf8 (by default it is
ascii), but that introduces lots of weird artifacts and is strongly
discouraged. The default encoding matters when you do comparisons
between strings and a few other situations -- generally either the str
has to be decoded to a unicode object, or the unicode object encoded
with the default encoding before they can be usefully compared. You can
disable the default encoding entirely, but I suspect it would break far
too many things to do so. (Though I suppose we could also make the
default encoding something that will produce a warning, and at least get
some indication of possible encoding errors.)
Generally if we test with non-ASCII data we'll see encoding problems
fairly early. With ASCII data encoding problems can stay hidden quite
easily. English speakers tend not to create non-ASCII test data, which
is a problem.
--
Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org
More information about the Sugar-devel
mailing list