[IAEP] Sunjammer crash postmortem

Bernie Innocenti bernie at sugarlabs.org
Thu Jan 28 18:43:48 EST 2016


Please follow up with your notes & corrections, but take off iaep@ and
sysadmin at gnu.org from Cc to avoid spamming them.


== Incident Timeline ==

[Thu Jan 28 05:03] OOM killer kicks in, killing a bunch of processes
[Thu Jan 28 05:29] OLE Nepal notifies sysadmin at sugarlabs.org and
bernie.codewiz at gmail.com of an outage.
[Thu Jan 28 07:42] OOM killer kicks in again
[Thu Jan 28 08:45] Scg notices the outage and pings me via Hangouts
[Thu Jan 28 09:30] I wake up and see scg's ping
[Thu Jan 28 09:47] I respond to OLE, cc'ing all other sysadmins
[Thu Jan 28 12:17] Quidam reboots sunjammer


== Root causes ==

Unknown OOM condition, likely caused by apache serving some query-of-death:

Jan 28 03:07:25 sunjammer kernel: [88262817.489410] apache2 invoked
oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0
Jan 28 03:07:26 sunjammer kernel: [88262817.489428] apache2 cpuset=/
mems_allowed=0

[...]

Jan 28 03:09:52 sunjammer kernel: [88262818.691465] Out of memory: Kill
process 32000 (apache2) score 8 or sacrifice child
Jan 28 03:09:52 sunjammer kernel: [88262818.691473] Killed process 32000
(apache2) total-vm:571328kB, anon-rss:52460kB, file-rss:65036kB

[...keeps going on like this for hours...]

Jan 28 07:42:12 sunjammer kernel: [88279272.739371] apache2 invoked
oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
Jan 28 07:42:12 sunjammer kernel: [88279272.739390] apache2 cpuset=/
mems_allowed=0
Jan 28 07:42:12 sunjammer kernel: [88279272.739397] Pid: 4835, comm:
apache2 Tainted: G      D     3.0.0-32-virtual #51~lucid1-Ubuntu


== What went wrong ==

- The primary sysadmin contact sysadmin at sugarlabs.org was non-functional
- We couldn't contact the FSF sysadmins promptly
- Took us several hours to get the machine back online
- sunjammer was still up, but too unresponsive to ssh in


== What worked ==

- Scg noticed the outage quickly and responded
- OLE reached me via gmail -> develer.com forwarder (pure luck, I
usually don't check my personal email before leaving for work)
- sunjammer styed up continously for over 1000 days
- sunjammer still boots correctly... at least now we know :-)
- Communication between us kept working via side-channels
- The Linux OOM killer did its job ;-)


== Action Items ==

- Continue moving web services to Docker containers *WITH HARD MEMORY
BOUNDS*
- Ask FSF to (re-)enable XEN console for sunjammer
- Ask for FSF on-call contact
- (maybe) Move monitoring to a smaller container
- Publish phone/email emergency contacts that page core sysadmins
independent of all SL infrastructure.
- (maybe) Disable swap to prevent excessive I/O from slowing down
sunjammer to the point of timing out ssh connections
- Work with FSF sysadmins to figure outw I/O is so slow on sunjammer. A
simple "sync" can take several seconds even though there isn't much disk
activity.

-- 
Bernie Innocenti
Sugar Labs Infrastructure Team
http://wiki.sugarlabs.org/go/Infrastructure_Team


More information about the IAEP mailing list