[Systems] [IAEP] Do nice guys finish first?

Bernie Innocenti bernie at codewiz.org
Mon Dec 14 13:15:37 EST 2009


On Sun, 2009-12-13 at 14:06 +0100, Sean DALY wrote:
> David, I'm not sure what you mean. Do you feel some contributors
> aren't working together?

I'm afraid I may have something to do with it. Perhaps I shall try to
explain what happened, with full background.

David and I happen to have different opinions on how to deal with the
increase of traffic on ASLO. I'm afraid our technical disagreement has
ended up eroding our personal relationship too.

David took the initiative by performing an incredible amount of research
and work to split the service into 3 separate components: one
load-balancing proxy, a web server and a database back-end.

By replicating the web server, we can make the cluster serve more
requests almost linearly, at least until we'll start to saturate the
database server and, much later, the caching proxy. This is a well-known
architecture, and David implemented it by the manual. An impeccable job.

Inevitably, clustering also introduced a lot of fragility and complexity
that is costing us a lot in terms of maintenance and downtime. If any
one of these machines malfunctions, ASLO goes down.
By merely multiplying the number of VMs by 3, this becomes 3 times more
likely, but in practice it's much worse because of the complex
interaction between the machines. For example, restarting the whole
cluster may fail depending on the relative boot order of the nodes.

There are also hidden performance costs and security risks due to
network communication, some of which we have not yet been solved. For
example, backups used to start all at the same time, causing a storm of
I/O and CPU consumption. Last week we also experienced an unfortunate
hardware failure, which caused prolonged downtime because moving around
so many VMs and reconfiguring them involves a lot of work. Debugging
problems involves logging in on multiple machines at once and comparing
logs. It takes time.

For weeks, David has been diligently running benchmarks, tuning caches
and tweaking configurations, while I've probably been excessively
critical and even rude at times in expressing my skepticism. I shall
apologize for this. I actually admire David's work exactly because I
appreciate its complexity.

However, no matter how polite I could get, I can't hide my concern. If
we care about uptime, we'd need to make the DB and the proxy redundant
before going into production. This would increase the minimum number of
nodes to 6. We'd need to acquire at least one new machine to cope with
hardware failures.

David says I've been quick to criticize his work rather than
propositive. Actually, I proposed a contingency plan:

Let's postpone the moment when we need to deal with so much complexity
by switching to faster hardware. ASLO has saturated sunjammer, an old
machine which is slower than my landlord's Thinkpad T61. Really! Any
cheap machine we could get today would buy us 1-2 years of time.

For example:

     2x Xeon E5520 (four 2.26GHz cores each, so 8 cores total)
     16GB 1066MHz RAM
     Rack rails
     Redundant power supply
     1x 250GB SATA 3.5" drive, room for 3 more
     Total price: $2856

We already have some $2000 in our IT budget, the very first money we're
going to spend on infrastructure. Since we're also looking to replace
the aging Solarsail, Ivan offered to contribute to the expense. David
and I are looking around for free hosting.

At this time, I have barely the time to keep all these VMs up and
running. Consolidating the current mess of disparate machines and
substantially reduce maintenance work, hopefully freeing us up for
higher level work.

This is a "temporary" solution. If traffic keeps increasing at the
current rate, the day will come when we'll be forced to cluster ASLO and
other services. Hopefully by then we'll have a kick-ass IT department
with many full-time sysadmins who would keep it running.

Meanwhile, we can continue experimenting with the herd of virtual
machines, but I'm somewhat reluctant to switch them again into
production without the necessary safety.

NOTE: I said *reluctant*, not *unwilling*. I ultimately believe in
empowering the doers, and since David has been doing all the work on
ASLO, he can choose to ignore my alternative proposal. I won't drag my
feet. I'll help David succeed anyway I can, as I've been doing already.

I guess David and I could keep disagreeing without necessarily
compromising our ability to work together productively on fulfilling the
needs of Sugar's growth.

-- 
   // Bernie Innocenti - http://codewiz.org/
 \X/  Sugar Labs       - http://sugarlabs.org/



More information about the Systems mailing list