[IAEP] [Systems] wiki down?

Bernie Innocenti bernie at codewiz.org
Sun May 31 14:20:22 EDT 2009


[cc += ivan]

On 05/31/09 15:37, Tomeu Vizoso wrote:
> On Sun, May 31, 2009 at 15:09, Seth Woodworth <seth at isforinsects.com> wrote:
>> No, I can reach the wiki just fine.  I clicked around to a few pages I
>> couldn't possibly have cached on my machine (user profiles, etc).
> 
> solarsail was down earlier today, Bernie rebooted it and is fine now.

I dug the logs looking for the root cause of this outage, but I couldn't
find conclusive evidence.

Looks like an out of memory condition, where the machine was paging out
everything to make room for a runaway process.   But one would wonder
why the OOM killer did not kick in to do its job.  cjb could
miraculously log in through ssh and issue an uptime command.  The
reported load average was 66 (we have 32 processors so it's not that
incredible :-)

Where can we put the blame?  I'd like to point fingers at trac, but I'm
not sure.  The high load seems to imply *many* runaway processes, but
trac's runs as a single-threaded application, unless I'm mistaken.  The
previous time, I remember seeing plenty of Apache instances in ps.

Anyway, the frequency of these outages is about 2-3 months, and a forced
reboot seems to fix it.  If we can't figure it out this time, we don't
necessarily have to start running around with hair on fire.

WARNING: the current default kernel (vmlinuz-2.6.24-23-sparc64-smp)
hangs immediately at boot.  I had to pass linuxOLD to silo.  We're now
running 2.6.24-22-sparc64-smp.  This needs further investigation.

-- 
   // Bernie Innocenti - http://codewiz.org/
 \X/  Sugar Labs       - http://sugarlabs.org/


More information about the IAEP mailing list