<div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> But first, a note: we obviously agree on the basics: The .py file should be readable without a special editor, and usable as such by the interpreter; translation should be a process after load and before save in the special editor (possibly with some state memory in between.) What we're arguing about is getting more and more marginal or implementation-related, and I believe that's a good thing. </blockquote><div> Agreed. </div>It seems that you're even favorable to my proposed compromise, so let's focus our work hashing it out. <blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> > 3. Compromise: Do a good half job first, then optionally do the > whole job. > > When initially created, files have a "preferred" language and > things behave more-or-less as in 2. When somebody decides that the > "interface" for a file has a complete English translation, they can > convert that file into "English preferred", which puts language > tags on all identifiers not yet translated. This also creates an > "internal" section in that file's translation dictionary which is > not used by clients (until the user asks for it, and it's then > moved to the "public" section of the dictionary). An English- > preferred file can be edited in another language, and all new > identifiers are tagged with the UI language. Thus, English- > preferred files behave much as in option 1. I like your compromise a lot. I was myself coming to the notion of a "default language tag" that would allow to not see language tags strewn around in a normal editor. The only thing I would want to add to that spec is the possibility that someone with a language-aware editor set to German could edit a French-preferred file, even w/o English translations, by adding German language tags to new identifiers. So identifiers without tags are assumed to be French as it's the module's setting; new ones can still be tagged. What do you think? Or did you already have this in mind?</blockquote><div> Hadn't thought of it, but it's not too hard. </div> <blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> >> Related question: Do you expect translation to carry between modules? > > Yes, I expect it to carry across one level of importation. Not > across any module in the universe, oh God no. > >> Eg would translating file.close as fichier.fermer have an impact on >> anything named 'close' in any module (such as StringIO)? >> I admit I did not think so initially: translations were module- >> specific in my mind. But I am now reconsidering; my initial position >> certainly makes duck typing un-intelligible. But applying >> translations broadly may be an issue in some cases, and introduce >> more ambiguity. > > Again, this is the reason I want to start out NOT translating > module internals - overapplying half-assed internal translations. I am afraid the problem I raised has nothing to do with internals. "close" is very much part of the public interface. Let me reiterate: Suppose two modules m1 and m2 define classe c1 and c2 respectively, both with a method X in their public interface. m1 is translated, and states that X is iks in the current target language. m2 is untranslated. We are trying to display a file that goes like this: import m1, m2 def maFonction_i18n_fr(aParam):     aParam.X() EOF We do not, cannot know whether aParam refers to an m1.c1 or m2.c2 instance; possibly maFonction could receive both instances. so we cannot know whether X is translated or not. I am not sure how to handle it; how do you?</blockquote><div> Good job stating the case. Just one detail that I think you misstated: m1 states X is english, not the current target language, otherwise X would not be showing up in the file on disk (as you state the problem, at worst it would be ***m1lang___X*** which is NOT in the example I want). Let's call the third module which is importing m3. As stated, our editor knows that X exists in m1, because it has an entry in the public dict. In my vision, the editor does not go willy-nilly scanning actual .py files, it just imports public dicts, so it may or may not know that m2 exports an X depending whether that's in m2.t7n . If it knows, things are fine. As soon as you put an identifier in the public dict of m2, that identifier is tagged on disk for safety. So we have m2lang___X, and no collision, though this can show up either as (X, m2lang___X), (en___X, X), or (en___X, m2lang___X) depending on m3lang. (and if m1 has a good translation that would go in the first element of those lists, and if m1 gives the null translation of X=X you could even end up seeing the disambiguation (m1___X, m2___X). If you type X in that last case, it shows up bright red as an error and the file refuses to save until you fix it.) If it does NOT know, then we have a problem, and our goal should be to let an aware user notice it and fix it as soon as possible. Remember, for now this file works as intended, but if someone comes along and gives m2 a new English translation for X, everything will break. If m3lang == m2lang, the original m3 programmer should have seen either en___X or m1translation. If they type X, this will automatically be changed to m3lang___X on disk, so their code never ran, and so they catch the error and right-click on X and say 'add translation to file m2' and leave the English blank because they don't speak English. And everything works fine then because both files use m3lang___X on disk. If m3lang == en, then we have a tougher problem waking the programmer up. They would have to be moderately aware to notice that when they looked into m2 to find that the method was called X, everything was colored non-English and was in some funny language. So they should by instinct want to write m2lang___X if they mean the m2 one. And then when that doesn't work, they can easily fix the problem as above. Say they don't wake up. Then somebody else comes along and adds a translation to m2, saying that X is exported (either giving an English translation Y or leaving it as m2lang___X; let's call those both Y). The next time our English-speaking m3 programmer opens their file, our logic tries to add the new translation, notices the conflict, and complains. It turns all the Xes into WARNING___X or something, then tells the programmer to search through for WARNING___X and change it either into X or Y, depending on whether its from m1 or m2. (If they are the one to add the new translation, all the better; if m3 is open when the translation is added, they get the warning right away.) That last paragraph is the only explicit coding we have to do for this case. The rest of it all happens naturally, as a consequence of how things work anyway.   </div> <blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> > >> a) I believe there should be one translation file per language. > >> More file pollution, less parsing.... > > Even if we don't have a scifi  live, wiki-style central translation > database, I think we can at least count on some sort of managed git > type deal. It would not be very hard to make a tool to merge in or > delete out languages from a given translation file, as long as > there was at least one canonical language for each file. So on my > computer, I only have the languages for my country, including > immigrant communities weighted by some compromise between the total > language size and its in-country size, plus English. I agree we should be able to select which languages we are storing. But we are exchanging information, so the information on all languages probably exists somewhere (even if only in distributed form, which is the most difficult case.) If the default file format mixes language information, it creates a burden of internal file management where there could be none. One file per language sidesteps a lot of manipulation.</blockquote><div> OK.  Here are the issues: file pollution parsing CPU burden memory burden file manipulation burden         related: human-readable and editable (with text editor, or just with spreadsheet?) disk space ... I think I could come up with a plan to optimize for any 3 of those attributes, maybe 4.but not for all 6. So what are the priorities? On OLPC: I I'd say memory, human-readable, and disk space. Which, darn it, gives you the win here over all my 'clever' (ie homemade and nonstandard) data format designs. But one problem is that that does mean a real possibility that file versions will desynchronize, and that looks like a problem to me... do you have a plan? </div> And apparently you're seeing a totally different issue, because I just can't figure out the problem that the following is intended to remedy, or what you�re talking about at all. For me, a giant matrix is fundamental, and we're just arguing about how to slice it up on disk, it appears you have a different idea. Remember, as in the above discussion, that even a row in that matrix with only one untranslated word in it can works to announce that that word is publicly exported by a given file.... so how do we keep the rows in sync between subfiles? Also, this would mean that you need at least English AND the preferred language in each subfile. <blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> OK, here's where I'm at: Each translation is indexed by the hash of the identifier (we can even store it as binary.) Collisions may happen; in that case use the identifier in full. (Signal with a hash=0 index.) Should be rare enough to have a negligible impact on size. The difficult case is when an identifier is introduced (in a normal editor) that collides with a previously translated identifier. The solution is to maintain one file that contains all non-colliding identifiers with the hash they had before collision. (Since translations are only created by the language-aware editor, it knows to maintain that file.) New identifiers that collide with those will be detected by the new editor, and will allow to update translation files correctly. Note that read access to that file should be rare; only when a collision is actually detected in the module. (However, saving the module involves maintaining the hash file.) Ugh. Not that pretty either. But easily isolatable as a subsystem; and I am convinced that it beats merging data from disparate sources into a large matrix/paring down the matrix... (I'd be willing to write that if you want.) </blockquote></div>