[Systems] Sunjammer crash postmortem
Samuel Cantero
scanterog at gmail.com
Thu Jan 28 20:32:16 EST 2016
I will add some comments:
On Thu, Jan 28, 2016 at 8:43 PM, Bernie Innocenti <bernie at sugarlabs.org>
wrote:
> Please follow up with your notes & corrections, but take off iaep@ and
> sysadmin at gnu.org from Cc to avoid spamming them.
>
>
> == Incident Timeline ==
>
> [Thu Jan 28 05:03] OOM killer kicks in, killing a bunch of processes
> [Thu Jan 28 05:29] OLE Nepal notifies sysadmin at sugarlabs.org and
> bernie.codewiz at gmail.com of an outage.
> [Thu Jan 28 07:42] OOM killer kicks in again
> [Thu Jan 28 08:45] Scg notices the outage and pings me via Hangouts
> [Thu Jan 28 09:30] I wake up and see scg's ping
> [Thu Jan 28 09:47] I respond to OLE, cc'ing all other sysadmins
> [Thu Jan 28 12:17] Quidam reboots sunjammer
>
>
> == Root causes ==
>
> Unknown OOM condition, likely caused by apache serving some query-of-death:
>
Apache wasn't the only process calling oom-killer. I found also opendkim,
spamc, mb, uwsgi, and smtpd.
The first incident was at Jan 28 03:07:25. Usually we have a lot of memory
available in sunjammer. Munin stopped plotting at 02:40 and the memory was
low as expected. I just can only imagine some kind of unmanaged
over-commitment (over-provisioning) in the Xen Dom0.
> Jan 28 03:07:25 sunjammer kernel: [88262817.489410] apache2 invoked
> oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0
> Jan 28 03:07:26 sunjammer kernel: [88262817.489428] apache2 cpuset=/
> mems_allowed=0
>
> [...]
>
> Jan 28 03:09:52 sunjammer kernel: [88262818.691465] Out of memory: Kill
> process 32000 (apache2) score 8 or sacrifice child
> Jan 28 03:09:52 sunjammer kernel: [88262818.691473] Killed process 32000
> (apache2) total-vm:571328kB, anon-rss:52460kB, file-rss:65036kB
>
> [...keeps going on like this for hours...]
>
> Jan 28 07:42:12 sunjammer kernel: [88279272.739371] apache2 invoked
> oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
> Jan 28 07:42:12 sunjammer kernel: [88279272.739390] apache2 cpuset=/
> mems_allowed=0
> Jan 28 07:42:12 sunjammer kernel: [88279272.739397] Pid: 4835, comm:
> apache2 Tainted: G D 3.0.0-32-virtual #51~lucid1-Ubuntu
>
>
> == What went wrong ==
>
> - The primary sysadmin contact sysadmin at sugarlabs.org was non-functional
> - We couldn't contact the FSF sysadmins promptly
> - Took us several hours to get the machine back online
> - sunjammer was still up, but too unresponsive to ssh in
>
>
> == What worked ==
>
> - Scg noticed the outage quickly and responded
> - OLE reached me via gmail -> develer.com forwarder (pure luck, I
> usually don't check my personal email before leaving for work)
> - sunjammer styed up continously for over 1000 days
> - sunjammer still boots correctly... at least now we know :-)
> - Communication between us kept working via side-channels
> - The Linux OOM killer did its job ;-)
>
>
> == Action Items ==
>
> - Continue moving web services to Docker containers *WITH HARD MEMORY
> BOUNDS*
> - Ask FSF to (re-)enable XEN console for sunjammer
> - Ask for FSF on-call contact
> - (maybe) Move monitoring to a smaller container
> - Publish phone/email emergency contacts that page core sysadmins
> independent of all SL infrastructure.
> - (maybe) Disable swap to prevent excessive I/O from slowing down
> sunjammer to the point of timing out ssh connections
> - Work with FSF sysadmins to figure outw I/O is so slow on sunjammer. A
> simple "sync" can take several seconds even though there isn't much disk
> activity.
>
Regarding to disk I/O:
iostat shows:
- An average of 32 tps (IOPS) in the first partition (/root).
iostat -x shows
an average latency (await) of 126 ms. The 25% are read operations and the
75% are write operations. Munin shows an average latency of 145 ms since
we're running diskstats plugin.
- An average of 26 tps in the third partition (/srv). iostat -x shows an
average latency of 16.5 ms. The 81% are read operations and the 19% are
write operations. Munin shows an average latency of 14.5 ms.
sar -dp -f /var/log/sysstat/sa[day] shows (for some days):
- Jan 27:
- An avg of 26 tps (IOPS) in the first partition (/root). An avg
latency of 126 ms.
- An avg of 11 tps in the third partition (/srv). An avg latency of
29 ms.
-
- Jan 26:
- An avg of 27 tps (IOPS) in the first partition (/root). An avg
latency of 126 ms.
- An avg of 11 tps in the third partition (/srv). An avg latency of
29 ms.
I can check this avg in the other days.
As we can see, we have a high latency on the first partition (where
databases reside) and taking into account that our VM is struggling for
disk I/O in an old disk subsystem, it is likely that 37 IOPS would be a big
part of the total maximum IOPS value.
>
> --
> Bernie Innocenti
> Sugar Labs Infrastructure Team
> http://wiki.sugarlabs.org/go/Infrastructure_Team
> _______________________________________________
> Systems mailing list
> Systems at lists.sugarlabs.org
> http://lists.sugarlabs.org/listinfo/systems
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.sugarlabs.org/archive/systems/attachments/20160128/884ed785/attachment.html>
More information about the Systems
mailing list