[Systems] Sunjammer crash postmortem
Bernie Innocenti
bernie at sugarlabs.org
Thu Jan 28 18:43:48 EST 2016
Please follow up with your notes & corrections, but take off iaep@ and
sysadmin at gnu.org from Cc to avoid spamming them.
== Incident Timeline ==
[Thu Jan 28 05:03] OOM killer kicks in, killing a bunch of processes
[Thu Jan 28 05:29] OLE Nepal notifies sysadmin at sugarlabs.org and
bernie.codewiz at gmail.com of an outage.
[Thu Jan 28 07:42] OOM killer kicks in again
[Thu Jan 28 08:45] Scg notices the outage and pings me via Hangouts
[Thu Jan 28 09:30] I wake up and see scg's ping
[Thu Jan 28 09:47] I respond to OLE, cc'ing all other sysadmins
[Thu Jan 28 12:17] Quidam reboots sunjammer
== Root causes ==
Unknown OOM condition, likely caused by apache serving some query-of-death:
Jan 28 03:07:25 sunjammer kernel: [88262817.489410] apache2 invoked
oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0
Jan 28 03:07:26 sunjammer kernel: [88262817.489428] apache2 cpuset=/
mems_allowed=0
[...]
Jan 28 03:09:52 sunjammer kernel: [88262818.691465] Out of memory: Kill
process 32000 (apache2) score 8 or sacrifice child
Jan 28 03:09:52 sunjammer kernel: [88262818.691473] Killed process 32000
(apache2) total-vm:571328kB, anon-rss:52460kB, file-rss:65036kB
[...keeps going on like this for hours...]
Jan 28 07:42:12 sunjammer kernel: [88279272.739371] apache2 invoked
oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
Jan 28 07:42:12 sunjammer kernel: [88279272.739390] apache2 cpuset=/
mems_allowed=0
Jan 28 07:42:12 sunjammer kernel: [88279272.739397] Pid: 4835, comm:
apache2 Tainted: G D 3.0.0-32-virtual #51~lucid1-Ubuntu
== What went wrong ==
- The primary sysadmin contact sysadmin at sugarlabs.org was non-functional
- We couldn't contact the FSF sysadmins promptly
- Took us several hours to get the machine back online
- sunjammer was still up, but too unresponsive to ssh in
== What worked ==
- Scg noticed the outage quickly and responded
- OLE reached me via gmail -> develer.com forwarder (pure luck, I
usually don't check my personal email before leaving for work)
- sunjammer styed up continously for over 1000 days
- sunjammer still boots correctly... at least now we know :-)
- Communication between us kept working via side-channels
- The Linux OOM killer did its job ;-)
== Action Items ==
- Continue moving web services to Docker containers *WITH HARD MEMORY
BOUNDS*
- Ask FSF to (re-)enable XEN console for sunjammer
- Ask for FSF on-call contact
- (maybe) Move monitoring to a smaller container
- Publish phone/email emergency contacts that page core sysadmins
independent of all SL infrastructure.
- (maybe) Disable swap to prevent excessive I/O from slowing down
sunjammer to the point of timing out ssh connections
- Work with FSF sysadmins to figure outw I/O is so slow on sunjammer. A
simple "sync" can take several seconds even though there isn't much disk
activity.
--
Bernie Innocenti
Sugar Labs Infrastructure Team
http://wiki.sugarlabs.org/go/Infrastructure_Team
More information about the Systems
mailing list