[Sugar-devel] Unicode strings in translations

Benjamin Berg benzea at sugarlabs.org
Thu Aug 16 09:29:30 EDT 2012


Hello,

Disclaimer: I have not read the entire discussion.

So, in Sugar we are mixing python unicode strings, and utf-8 encoded
python strings. This causes trouble, because once both object types are
mixed, the str() object will be converted to unicode(). And this
conversion fails because python uses ASCII as the default encoding.

Now the situation is:
 * unicode is the better string type in python
 * gettext can return both str or unicode (but will currently be str, I
hope utf-8)
 * GTK+ uses UTF-8 by definition
 * PyGI *accepts* unicode() or UTF-8 encoded str() objects.
 * PyGI *returns* UTF-8 encode str() objects.

Now, I general I would say using python unicode() strings is *much*
better than using a python (byte) str(). The trouble is, no matter what
you set gettext to (unicode vs. utf-8 encoded byte string) you will get
issues in normal use cases:

In the str() case, this will fail:
>>> _("¡Hola %s!") % u"äöüasdf"
But it will work when using strings from GTK+ widgets:
>>> _("¡Hola %s!") % entry.get_text()

Now, when you swap gettext setting, you get (note that I added the u""):
Works:
>>> _(u"¡Hola %s!") % u"äöüasdf"
Fails:
>>> _(u"¡Hola %s!") % entry.get_text()

So, there we are in trouble again as soon as the user enters a non-ASCII
character (not nice to debug).

I see the following ways forward:
 * Choose either utf-8 str or unicode and convert where necessary
 * Get python to use utf-8 as its default encoding (It seems there are
good arguments against this?). This would mean that the conversion
between unicode/str works by default.
 * Somehow get PyGI to return unicode() objects and switch gettext.

I do think that either way, there will likely be quite some places where
things goes wrong; and it is hard to get right.

For example take ticket #3763. I do think that the fix is unfortunately
wrong:
http://git.sugarlabs.org/get-books/mainline/commit/165ce7c28a51ec016f9b4dcbb2dd27203539d186

The patch only changes the case where it fails. Before it would fail
when the translation contained non-ASCII characters, now it fails when
the title, author or summary contains non-ASCII characters.

In the current situation, a more correct fix would be:
  _('Title:\t\t') + self.selected_title.encode('utf-8')

Benjamin

On Mon, 2012-08-13 at 13:35 -0300, Manuel Kaufmann wrote:
> Hello,
> 
> I'm working on Typing Turtle Gtk3 port and I found an error with the
> translations encoding. The thing is we can't combine Unicode strings
> and 8-bits strings:
> 
> 
> >>> '¡Hola %s!' % 'camión'
> '\xc2\xa1Hola cami\xc3\xb3n!'
> 
> >>> u'¡Hola %s!' % u'camión'
> u'\xa1Hola cami\xf3n!'
> 
> >>> '¡Hola %s!' % u'camión'
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
> 0: ordinal not in range(128)
> 
> >>> u'¡Hola %s!' % 'camión'
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
> 4: ordinal not in range(128)
> 
> 
> In Typing Turle the _ function (from gettext import gettext as _) is
> returning 8-bits strings, so it crashes when it tries to do something
> like this:
> 
> _('Congratulations!  You earned a %(type)s medal!') % u'gold'
> [...]
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
> 4: ordinal not in range(128)
> 
> But, if we don't mix Unicode and 8-bits strings it's possible to replace them:
> 
> >>> _('Congratulations!  You earned a %(type)s medal!')
> "Felicidades!  Has obtenido una medalla de %(type)s!"
> >>> _('Congratulations!  You earned a %(type)s medal!') % {'type': 'ORO'}
> "Felicidades!  Has obtenido una medalla de ORO!"
> 
> To get Unicode strings from gettext I had to put these lines in my
> lesssonscreen.py file:
> 
> import gettext
> gettext.install('po', unicode=True)
> 
> >>> _('Congratulations!  You earned a %(type)s medal!')
> u"Felicidades!  Has obtenido una medalla de %(type)s!"
> >>> _('Congratulations!  You earned a %(type)s medal!') % {'type': u'ORO'}
> "Felicidades!  Has obtenido una medalla de ORO!"
> 
> So, is there a way to make this available at Sugar level? This issue
> appears in many activities and it would be great to solve it
> "upstream" :)
> 
> Thanks,
> 
> Reference: http://docs.python.org/library/gettext.html
> 




More information about the Sugar-devel mailing list