[Systems] Sunjammer crash postmortem

Bernie Innocenti bernie at sugarlabs.org
Thu Jan 28 22:33:35 EST 2016


On 01/28/2016 08:32 PM, Samuel Cantero wrote:
> Apache wasn't the only process calling oom-killer. I found also
> opendkim, spamc, mb, uwsgi, and smtpd.
> 
> The first incident was at Jan 28 03:07:25. Usually we have a lot of
> memory available in sunjammer. Munin stopped plotting at 02:40 and the
> memory was low as expected. I just can only imagine some kind of
> unmanaged over-commitment (over-provisioning) in the Xen Dom0.

I don't think the Dom0 can steal ram from the domU. That's fixed, unless
you use those weird virtio balloon devices.

I'm not really sure what allocated all the memory, but it had to be
something internal. The kernel dumped a list of processes and their
memory usage at each oom iteration, but none of them is particularly big.

The real memory usage of apache is very hard to estimate, because it
forks plenty of children, each with big VSS and RSS figures. However,
most of the pages should be shared, so they don't add up.


> Regarding to disk I/O:
> 
> iostat shows:
> 
>   * An average of 32 tps (IOPS) in the first partition (/root). iostat
>     -x shows an average latency (await) of 126 ms. The 25% are read
>     operations and the 75% are write operations. Munin shows an average
>     latency of 145 ms since we're running diskstats plugin.
>   * An average of 26 tps in the third partition (/srv). iostat -x shows
>     an average latency of 16.5 ms. The 81% are read operations and the
>     19% are write operations. Munin shows an average latency of 14.5 ms.
> 
> sar -dp -f /var/log/sysstat/sa[day] shows (for some days):
> 
>   * Jan 27:
>       o An avg of 26 tps (IOPS) in the first partition (/root). An avg
>         latency of 126 ms.
>       o An avg of 11 tps in the third partition (/srv). An avg latency
>         of 29 ms.
>       o
> 
>   * Jan 26:
>       o An avg of 27 tps (IOPS) in the first partition (/root). An avg
>         latency of 126 ms.
>       o An avg of 11 tps in the third partition (/srv). An avg latency
>         of 29 ms.
> 
> I can check this avg in the other days.
> 
> As we can see, we have a high latency on the first partition (where
> databases reside) and taking into account that our VM is struggling for
> disk I/O in an old disk subsystem, it is likely that 37 IOPS would be a
> big part of the total maximum IOPS value.

Great analysis. Ruben and I upgraded the kernel to 3.0.0, which is still
ancient, but at least better than what we had before. We also disabled
barriers, which might not play well with the dom0 which is also running
a very old kernel.

Let's see if this brings down the damn latency.

-- 
Bernie Innocenti
Sugar Labs Infrastructure Team
http://wiki.sugarlabs.org/go/Infrastructure_Team


More information about the Systems mailing list