[Systems] Migration to SN testing node urgent

Aleksey Lim alsroot at sugarlabs.org
Mon Oct 20 16:12:56 EDT 2014


On Wed, Oct 15, 2014 at 11:10:57PM -0700, Bernie Innocenti wrote:
> On 15/10/14 22:06, Sebastian Silva wrote:
> > Very well then,
> > This is a good solution.
> > Thanks Alsroot.
> > 
> > Dear systems@ and Bernie,
> > We need to use another IP address for the Sugar Network. Is this
> > possible, could you please indicate which one? I also would like to
> > request access to signing the DNS records for node.sugarlabs.org or
> > assistance in this step, from the following procedure outlined by Aleksey.
> > 
> > Thanks in advance for your help.
> 
> We do have spare IPs, but first I'd like to understand why Apache is
> tipping over using a single IP and would work better with 2 IPs.
> 
> I assume you don't have a problem of too many idle connections lingering
> around, because a single IP can take tens of thousands. So it's probably
> Apache rejecting connections when you hit some configurable limit
> (MaxClients, ServerLimit, etc) which are meant to protect the server
> from DoS and overload conditions.
> 
> If the limits are set too low, we can just increase them, but bypassing
> them altogether would be unwise. If, for example, at peak time we
> receive 1000 simultaneous connections, but the server has enough memory
> only to handle 800 connections, the system will start trashing and
> OOMing, causing *all* users to be permanently unable to connect until
> the processes are restarted. Under some conditions, the kernel might
> even kill some vital process and require a manual reboot.
> 
> A more scientific approach for tuning things would be:
> 
>  - Setup good graphs for memory usage, cpu usage, number of active
> connections, numver of 500 errors served, etc. This can be done with Munin.
> 
>  - Send test traffic until the system overloads. Ideally we'd do this in
> a test environment without disrupting real traffic, but that's a bit
> complicated.
> 
>  - See which resource is topping: Is it memory? Is it disk I/O?
> 
>  - What's the maxiumum QPS (queries per second) you can get? Is it
> plenty more than what you get at peak time? If so, you're done.
> 
>  - If the QPS is not sufficient, provision the VM with more resources as
> needed. If you can't, consider sharding the service on multiple machines.
> 
> Remember not to leave the limits disabled after the load test. It will
> just cripple your server on the first spike of traffic.
> 
> Again, adding IPs is possible, but before doing so try figuring out
> what's causing the outage. I'm available on IRC to help debug this.
> 
> Also, resist the temptation of putting application servers written in
> Python and Ruby directly on the front line. They also speak HTTP, but
> typically they're insufficiently protected against various kinds of
> attacks, they have bad support for SSL, and they're very slow at serving
> static files. Plus, you'd loose Apache's logging and monitoring
> features, which can help with debugging.

Sorry for long delay, the case is that current clients in the field
and prod server suffer from wrong design decision when clients open
long-living connections. Last time I tried to tune
MaxClients/ServerLimit/ProxyTimeout and it seemed to work. Since
Apache started to be irresponsive again, idea was to run SN node (or
Apache) of separate IP to not affect connections poll for other Apache
sites. But, since Gitorious (the most visitable server in the past) is
not actual any more after moving all git projects to another hosting,
I guess it is ok to experiment w/ current Apache.

-- 
Aleksey


More information about the Systems mailing list