<div dir="ltr"><div>I will add some comments: </div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Jan 28, 2016 at 8:43 PM, Bernie Innocenti <span dir="ltr"><<a href="mailto:bernie@sugarlabs.org" target="_blank">bernie@sugarlabs.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Please follow up with your notes & corrections, but take off iaep@ and<br> <a href="mailto:sysadmin@gnu.org" target="_blank">sysadmin@gnu.org</a> from Cc to avoid spamming them.<br> <br> <br> == Incident Timeline ==<br> <br> [Thu Jan 28 05:03] OOM killer kicks in, killing a bunch of processes<br> [Thu Jan 28 05:29] OLE Nepal notifies <a href="mailto:sysadmin@sugarlabs.org" target="_blank">sysadmin@sugarlabs.org</a> and<br> <a href="mailto:bernie.codewiz@gmail.com" target="_blank">bernie.codewiz@gmail.com</a> of an outage.<br> [Thu Jan 28 07:42] OOM killer kicks in again<br> [Thu Jan 28 08:45] Scg notices the outage and pings me via Hangouts<br> [Thu Jan 28 09:30] I wake up and see scg's ping<br> [Thu Jan 28 09:47] I respond to OLE, cc'ing all other sysadmins<br> [Thu Jan 28 12:17] Quidam reboots sunjammer<br> <br> <br> == Root causes ==<br> <br> Unknown OOM condition, likely caused by apache serving some query-of-death:<br></blockquote><div><br></div>Apache wasn't the only process calling oom-killer. I found also opendkim, spamc, mb, uwsgi, and smtpd.<div><br></div><div>The first incident was at Jan 28 03:07:25. Usually we have a lot of memory available in sunjammer. Munin stopped plotting at 02:40 and the memory was low as expected. I just can only imagine some kind of unmanaged over-commitment (over-provisioning) in the Xen Dom0.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> <br> Jan 28 03:07:25 sunjammer kernel: [88262817.489410] apache2 invoked<br> oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0<br> Jan 28 03:07:26 sunjammer kernel: [88262817.489428] apache2 cpuset=/<br> mems_allowed=0<br> <br> [...]<br> <br> Jan 28 03:09:52 sunjammer kernel: [88262818.691465] Out of memory: Kill<br> process 32000 (apache2) score 8 or sacrifice child<br> Jan 28 03:09:52 sunjammer kernel: [88262818.691473] Killed process 32000<br> (apache2) total-vm:571328kB, anon-rss:52460kB, file-rss:65036kB<br> <br> [...keeps going on like this for hours...]<br> <br> Jan 28 07:42:12 sunjammer kernel: [88279272.739371] apache2 invoked<br> oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0<br> Jan 28 07:42:12 sunjammer kernel: [88279272.739390] apache2 cpuset=/<br> mems_allowed=0<br> Jan 28 07:42:12 sunjammer kernel: [88279272.739397] Pid: 4835, comm:<br> apache2 Tainted: G D 3.0.0-32-virtual #51~lucid1-Ubuntu<br> <br> <br> == What went wrong ==<br> <br> - The primary sysadmin contact <a href="mailto:sysadmin@sugarlabs.org" target="_blank">sysadmin@sugarlabs.org</a> was non-functional<br> - We couldn't contact the FSF sysadmins promptly<br> - Took us several hours to get the machine back online<br> - sunjammer was still up, but too unresponsive to ssh in<br> <br> <br> == What worked ==<br> <br> - Scg noticed the outage quickly and responded<br> - OLE reached me via gmail -> <a href="http://develer.com" rel="noreferrer" target="_blank">develer.com</a> forwarder (pure luck, I<br> usually don't check my personal email before leaving for work)<br> - sunjammer styed up continously for over 1000 days<br> - sunjammer still boots correctly... at least now we know :-)<br> - Communication between us kept working via side-channels<br> - The Linux OOM killer did its job ;-)<br> <br> <br> == Action Items ==<br> <br> - Continue moving web services to Docker containers *WITH HARD MEMORY<br> BOUNDS*<br> - Ask FSF to (re-)enable XEN console for sunjammer<br> - Ask for FSF on-call contact<br> - (maybe) Move monitoring to a smaller container<br> - Publish phone/email emergency contacts that page core sysadmins<br> independent of all SL infrastructure.<br> - (maybe) Disable swap to prevent excessive I/O from slowing down<br> sunjammer to the point of timing out ssh connections<br> - Work with FSF sysadmins to figure outw I/O is so slow on sunjammer. A<br> simple "sync" can take several seconds even though there isn't much disk<br> activity.<br></blockquote><div><br></div><div><div>Regarding to disk I/O:</div><div><br></div><div><font face="monospace, monospace">iostat</font> shows:</div><div><ul><li>An average of 32 tps (IOPS) in the first partition (/root). <font face="monospace, monospace">iostat -x</font><font face="arial, helvetica, sans-serif"> shows an average latency (await) of 126 ms. The 25% are read operations and the 75% are write operations. Munin shows an average latency of 145 ms since we're running diskstats plugin.</font></li><li><font face="arial, helvetica, sans-serif">An average of 26 tps in the third partition (/srv). </font><span style="font-family:monospace,monospace">iostat -x</span><font face="arial, helvetica, sans-serif"> shows an average latency of 16.5 ms. </font><span style="font-family:arial,helvetica,sans-serif">The 81% are read operations and the 19% are write operations. Munin shows an average latency of 14.5 ms.</span></li></ul><div><font face="monospace, monospace">sar -dp -f /var/log/sysstat/sa[day]</font> shows (for some days):</div></div><div><ul><li>Jan 27:</li><ul><li>An avg of 26 tps (IOPS) in the first partition (/root). An avg latency of 126 ms.<br></li><li><span style="font-family:arial,helvetica,sans-serif">An avg of 11 tps in the third partition (/srv). An avg latency of 29 ms.</span><br></li><li><span style="font-family:arial,helvetica,sans-serif"><br></span></li></ul><li><font face="arial, helvetica, sans-serif">Jan 26:</font></li><ul><li>An avg of 27 tps (IOPS) in the first partition (/root). An avg latency of 126 ms.<br></li><li><span style="font-family:arial,helvetica,sans-serif">An avg of 11 tps in the third partition (/srv). An avg latency of 29 ms.</span></li></ul></ul><div><font face="arial, helvetica, sans-serif">I can check this avg in the other days.</font></div></div></div><div><font face="arial, helvetica, sans-serif"><br></font></div><div><font face="arial, helvetica, sans-serif">As we can see, we have a high latency on the first partition (where databases reside) and taking into account that our VM is struggling for disk I/O in an old disk subsystem, it is likely that 37 IOPS would be a big part of the total maximum IOPS value.</font></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> <span><font color="#888888"><br> --<br> Bernie Innocenti<br> Sugar Labs Infrastructure Team<br> <a href="http://wiki.sugarlabs.org/go/Infrastructure_Team" rel="noreferrer" target="_blank">http://wiki.sugarlabs.org/go/Infrastructure_Team</a><br> _______________________________________________<br> Systems mailing list<br> <a href="mailto:Systems@lists.sugarlabs.org" target="_blank">Systems@lists.sugarlabs.org</a><br> <a href="http://lists.sugarlabs.org/listinfo/systems" rel="noreferrer" target="_blank">http://lists.sugarlabs.org/listinfo/systems</a><br> </font></span></blockquote></div><br></div></div>