[sugar] Develop i18n design (was Re: Develop activity (Oops...))

Jameson "Chema" Quinn jquinn
Tue Aug 14 15:34:41 EDT 2007


Just a quick response with first impressions about your comments.

There was one productive misunderstanding in what you said: I was thinking
of (well, actually, I'm almost done with a first mockup using IDLE) doing
ALL the translation in a layer that was invisible to the interpreter. Your
idea of having a block of assignments in a module's global namespace may be
useful - although it opens up the whole can-of-worms of namespace and scope.
I'll have to think about it to see if it's workable.

Further responses:


On 8/13/07, Marc-Antoine Parent <maparent at gmail.com > wrote:

> However, Jameson, if I may, I would take issue with a few assumptions
> of your model: most especially that of a "preferred language" for
> modules.
> I am referring to your point 3:
> > 3-This dictionary ONLY contains translations for the "public
> > interface" of somemodule.py, that is, those identifiers which are
> > used in importer modules. It also defines a single, unchanging
> > "preferred language" for that file, which is the assumed language
> > for all non-translated identifiers in that file.
>
> I am especially interested in collaborative work; and I believe it is
> not unreasonable to hope that children between schools in different
> countries will get to share some work.
> That would mean that a given modules may have many editors, possibly
> introducing identifiers in more than one non-English language.
> From that point of view, "preferred language" is a feature of an
> editing environment, not of a module. New identifiers should be
> individually tagged by language; I see that tagging as appropriate
> work for the editing environment. Basically, upon loading the file,
> all local identifiers would be read in memory; upon saving, new ones
> would be saved with a language tag. (Plausibly as a postfix,
> Identifier_i18n_2letterLanguageCode...)
> (I would otherwise follow Mike's suggestion to use a fixed
> transliteration table for non-latin scripts.)


I originially intended a design something like this. Here are the problems I
found:
1. requires an assumption that all those who edit a given module have our
magic editor, and that they all have their "preferred language" set
correctly. (Imagine a Belizian child who left it set on their country's
default "English" but actually edited in Spanish, Garifuna, Creole, or
Qeqchi).
2. It also means that the dictionary for a given module could get filled up
with translations of each functions internal variables, etc.
3. I see no way of parsing a file to see what is "public" (for importing)
and what is "private" (module internal). Thus the entire dictionary for a
module would have to be imported with the module.
4. For similar reasons, modules would have to import the dictionaries for
modules 2 levels up in the import inheritance. My paradigm lets you manually
do this only for the rare cases it's needed.

And what is the benefit? Modules which are a "language soup", maintained by
an international coalition of children who can't even talk to one another.

I support the idea of international, cross-language programming
collaboration. However, I think that a basic assumption would be that there
were at least one common language of communication between the participants,
and that any given module has an owner and thus a preferred language for all
its still-untranslated identifiers. If somebody wanted to "add to" that
module, they'd have to either write in that preferred language or use an
explicit import and subclass (or, of course, just do it in their language
for their own private use, because that file would then be explicitly marked
as "messy and suggested not for sharing")...

Do you still object? Because, um, I hate to be gauche, but well... I'm doing
the coding here so far... so unless I hear a consensus that I'm doing it
wrong I think I'll continue with this assumption.



> .....
> Even if someone decides on a better translation later on, more than
> one version may be kept in the translation block.


You just blew my mind. I need to think this suggestion over with a pencil
and paper.

This has the disadvantage of polluting the code, but the advantage of
> polluting the filesystem less.
>
> I am realizing a broader application of this mechanism: the
> translation block could be tagged with a revision number (if
> __revision__<540:), and the "import" command could mention the last
> known revision; so translation blocks would only be activated at
> need. But that's all another story.


Again, I have to think about that.

Another quick related note: What if someone adds a translation
> between two non-English languages? In your first email, you
> explicitly forbid it; I am not sure that is necessary. (I am not sure
> you think of it as necessary in your later design as well.) Clearly,
> however, X to Y translations may have to refer to the history (as
> language X is replaced by English) so as to become English to Y.


I think that in this case, we could make an explicit English placeholder,
along the lines of "fr_une_fonction_in_English". Then when the English is
added later, our magic can clean things up.

To finish with your design points, you introduce what I see as a
> severe limitation in your point 4:
>
> > 4-There is good UI support for creating a new translation for a
> > word. However, the assumed user model is that words will be
> > translated INTO a users preferred language; FROM the context of an
> > importer module (you'd generally not add translations for a module
> > from that module itself, since generally you wouldn't even have
> > modules open whose preferred language is not your own); and
> > therefore WITH an explicit user decision as to which module this
> > translation belongs in (they want to use their language for
> > identifier X which is in English, well, they must have had a reason
> > to write it in English rather than their language so they
> > presumably know what imported module it comes from.
> What really made me jump is the notion that "you wouldn't even have
> modules open whose preferred language is not your own". Again, this
> assumes a single preferred language per module, which is something I
> would rather avoid, and I believe is not necessary if identifiers
> have a language identifier.


If you have a good solution for the problems I mention above, I'd be happy
to consider it. As it is, I'm not saying it's impossible, but my feeling as
I tried to code it initially was that if you try to make things too "magic",
eventually you're going to make the wrong assumption and you're going to
create some bugs in the code being edited that are really really hard to
track down. I'd rather keep things "simple" (my current version of the
translator module is over 200 LoC, by the time I add file IO for the
translation dictionaries and clean things up, I expect it will be around 300
plus docstrings) so that it is at least possible for a programmer who hasn't
studied the code to have an intuitive grasp of what's going on "under the
hood" in their translator.


> However, I suspect your mention of "from the context of an importer
> module" comes from the issues you encountered with memorizing the
> import structure. I would like to hear more about the problems you
> ran into there, because I believe it is necessary (for reasons to be
> detailed below.)


The "from the context of" is more a UI than a logical consideration. It is a
way to encourage people to translate where it's most natural to do so: INTO
languages they actually use, and WHEN it's actually useful to do so, and
also WHEN they would naturally be conscious of what file the translation is
native for. It would be possible to do translations in other contexts, but
the special UI sugar (context menus, tooltips, and the like) would encourage
you to do it in this context.

Now, a few suggestions and pitfalls of my own:
>
> a) I believe there should be one translation file per language. More
> file pollution, less parsing.


My current mockup is going to use csv for its dictionary file format -
Engish in the first column and one language per additional column, in order
of addition. Internally it is just a 2d array - a list of lists. A "row" is
only as long as it has to be and can have any number of "empty" values as
long as at least one value is defined. The parsing on this file is, needless
to say, trivial. It would not be much harder if you converted it to XML,
with one word in one language per line, and you'd get more-fine-grained
diffs for free.

I suspect that something akin to the getinfo file structure would be
> appropriate:
> package/module1.py
> would be translated in
> package/_t9n_/fr/module1.pyt
> package/_t9n_/fr/module1.pyto (object, like a .mo file)
> package/_t9n_/es/module1.pyt
> package/_t9n_/es/module1.pyto
> and so on.


OK, that is a valid point, it might be nice to follow getinfo format to
leverage existing tools. However, it is very useful in my model that once a
word exists in the dict for any language, it is noticable from the
perspective of all languages, even if it has not been translated at all yet.
This is a benefit of a single multilanguage dict.

b) A particularly fancy editor would color-code words in other
> languages instead of showing the _i18n_xx tag.
> Of course there would be a way to access online translation services
> to get suggestions (as has been suggested by many.)


yes.

c) Sci-fi scenario:  any new translation suggestion by a child or
> educator should be made available to others using a distributed
> database system... (they are likely to work on common projects, and
> hence on common modules.)
> The children educators known to be knowledgeable about a given
> language pair should have a way to vet translations in that database.
> Oh, and let's send it to planet python so we have a basis to build
> the translation files to the standard library for very obscure
> languages ;-)
> (OK, that _is_ sci-fi. Still worth thinking about!)


I've been thinking about it. The problem, as you point out, is hygiene; you
need to build some whole new ratings/trust model (and UI) for conflicting
translations. Still, I think that if you make a clean design for one person,
adding this functionality later should actually not be that sci-fi/hard.
This would certainly give a "critical mass" to the concept!

d) Back to earth: I said we really had to know the import
> structure... here is a slew of related problems:
>
> Suppose we are editing a module that is importing something from the
> core library:
>
> from moduleX import f1
> from moduleZ import f2
> f1()
> f2()
>
> Now, suppose f1 and f2 both translate as "sigma" in the current
> editing language... Then, though the .py code is unambiguous, the
> translated on-screen code looks ambiguous; and worse, the un-
> translation process on save is not well-defined.
> The solution is to actually un-specify the imports in the source code:
>
> import moduleX
> import moduleY
> moduleX.sigma()
> moduleY.sigma()
>
> This refactoring should be possible in most cases, unless two top-
> level modules have similar translations. (say moduleX and module Y
> both translate as "modula")
> This situation should be marked as an error; or alternately _display_
> the following:
>
> import moduleX__i18n_en_ as modula_1
> import moduleY__i18n_en_ as modula_2
> modula_1.sigma()
> module_2.sigma()
>
> This is not an interpreter-level change, but a disguised display. (or
> rather a refactoring which can be memorized, and reverted by the
> untranslation machinery.)
> Note that display-only import disambiguation may also be necessary if
> the above code happens in a core library file (which we would never
> modify.)


I have already implemented display-only disambiguation (and reambiguation -
consider when one English word has different French translations in moduleX
and moduleY, you need to call it moduleX___voir___moduleY___ve , sorry my
French conjugations have been overloaded by my Spanish ones so that may be
wrong but you get the idea). I don't see that on-disk refactoring is
necessary, though the UI should encourage/support you to manually disambig
your translations.

In any case it is useful to flag as an error any translation that
> introduces ambiguity within the same namespace.


Again, UI warnings or refusals to create bad situations, but sensible
handling/resolution support when they occur anyway.

I have found one case where you actually need to do a manual disambig for
every case of an identifier in your file. Say you are importing an English
color_module with the translation red -> rojo. You're also importing an
UNTRANSLATED spanish_networking_module which uses the identifier red. Now
you want to add the translation network -> red to spanish_networking_module.
If you were working in Spanish, the editor could have warned you when you
typed "red" and helped you change it into "es___untranslated___red" (using
in-scope interpreter-based refactoring to save this as something other than
"red" in on-disk). If you were working in English, you could at least have
had color cues that would let you distinguish between "spanish red"s and
"english red"s as you coded. But if you had ignored the color cues or had
been working in a non-translation-aware editor, you now have to go through
the file and tell the editor whether each example of "red" is English or
Spanish.

This is the kind of mess which I want to handle right, of course, but also
to keep to a minimum. With my use model, I think this will be an extremely
rare case (and, as I said, there will be UI that will let an aware user fix
it ahead of time). I strongly suspect that if we abandon the per-file
preferred-language model, this kind of mess will be common.

> Similar transformations may be made necessary by "from moduleX import
> *" syntax.
>
> None of this is simple, as I said; but alas probably necessary.
>
> e) Would we display numbers as the equivalent numerics in other
> writing systems?


I was unaware that there was any place in the world where first numeracy was
not in arabic numerals? I know that there are numeral characters in Asian
languages, but I thought that math was taught in Arabic numerals even there
- just as people here in Guatemala learn base-20 Mayan numerals but don't
use them day-to-day, even a native Mayan speaker who doesn't speak
Spanish will only speak in Mayan numerals up to at most 19, beyond that they
use Spanish with Mayanized phonetics.

Is this important and worth doing? Shouldn't be too hard if it is.

f) Docstrings... are another issue entirely. I still like my idea of
> a distributed database, so children puzzling out a foreign (to them)
> docstring with online help can put their minds together.


Absolutely. Let's say: docstrings are our strong motivation to get around to
the version-2-with-distributed-database.

OK, I am giving more problems than solutions, here; and
> unfortunately, my spare time is otherwise quite occupied, so I doubt
> I can contribute to implementation; still, I hope that spelling some
> of these things out is useful to others. I'll try to keep my thinking
> cap on as this discussion evolves.
>
> Cheers,
> Marc-Antoine Parent
> http://maparent.ca/
>
> P.S. I _love_ your idea of arrows in the margin to indicate flow!
>
Thanks. I think it would be good too, but this is one I do not plan to
program myself.

Hasta luego,
Jameson
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.laptop.org/pipermail/sugar/attachments/20070814/a28ac1fb/attachment-0001.htm 



More information about the Sugar-devel mailing list