[Systems] Migration to SN testing node urgent

Bernie Innocenti bernie at codewiz.org
Thu Oct 16 02:10:57 EDT 2014


On 15/10/14 22:06, Sebastian Silva wrote:
> Very well then,
> This is a good solution.
> Thanks Alsroot.
> 
> Dear systems@ and Bernie,
> We need to use another IP address for the Sugar Network. Is this
> possible, could you please indicate which one? I also would like to
> request access to signing the DNS records for node.sugarlabs.org or
> assistance in this step, from the following procedure outlined by Aleksey.
> 
> Thanks in advance for your help.

We do have spare IPs, but first I'd like to understand why Apache is
tipping over using a single IP and would work better with 2 IPs.

I assume you don't have a problem of too many idle connections lingering
around, because a single IP can take tens of thousands. So it's probably
Apache rejecting connections when you hit some configurable limit
(MaxClients, ServerLimit, etc) which are meant to protect the server
from DoS and overload conditions.

If the limits are set too low, we can just increase them, but bypassing
them altogether would be unwise. If, for example, at peak time we
receive 1000 simultaneous connections, but the server has enough memory
only to handle 800 connections, the system will start trashing and
OOMing, causing *all* users to be permanently unable to connect until
the processes are restarted. Under some conditions, the kernel might
even kill some vital process and require a manual reboot.

A more scientific approach for tuning things would be:

 - Setup good graphs for memory usage, cpu usage, number of active
connections, numver of 500 errors served, etc. This can be done with Munin.

 - Send test traffic until the system overloads. Ideally we'd do this in
a test environment without disrupting real traffic, but that's a bit
complicated.

 - See which resource is topping: Is it memory? Is it disk I/O?

 - What's the maxiumum QPS (queries per second) you can get? Is it
plenty more than what you get at peak time? If so, you're done.

 - If the QPS is not sufficient, provision the VM with more resources as
needed. If you can't, consider sharding the service on multiple machines.

Remember not to leave the limits disabled after the load test. It will
just cripple your server on the first spike of traffic.

Again, adding IPs is possible, but before doing so try figuring out
what's causing the outage. I'm available on IRC to help debug this.

Also, resist the temptation of putting application servers written in
Python and Ruby directly on the front line. They also speak HTTP, but
typically they're insufficiently protected against various kinds of
attacks, they have bad support for SSL, and they're very slow at serving
static files. Plus, you'd loose Apache's logging and monitoring
features, which can help with debugging.


> --
> Sebastian Silva
> "icarito" #sugar #somosazucar (freenode IRC)
> Somos Azúcar - Fuente Libre - Sugar Labs
> 
> "Las maestras y los maestros democráticos intervenimos en el mundo a través del cultivo de la curiosidad" - P.Freire
> 
> 
> El mié, 15 de oct 2014 a las 11:26 PM, Aleksey Lim
> <alsroot at sugarlabs.org> escribió:
>> On Wed, Oct 15, 2014 at 11:34:01AM -0500, Sebastian Silva wrote:
>>
>>     Alsroot, Greetings, We're observing downtime about twice a day now
>>     in production instance of Sugar Network central node. Every time I
>>     have to log into jita and issue: sudo /etc/init.d/sugar-network
>>     stop ps -o pid,comm,user,thcount -u www-data | wc -l # ^^ is
>>     useful to give an idea of traffic # goes down to ~12 when all SN
>>     threads die after a few seconds sudo /etc/init.d/sugar-network
>>     start It's probably all the traffic, but also it seems to have
>>     gotten worse after the downtime jita had some time ago. It's
>>     stressful for editors/admins and annoying for users. As I
>>     understand it, new node implementation does not have this problem.
>>     I can do the migration myself, if you provide me with some
>>     details: * procedure for migrating database * current release
>>     tarballs/sources for putting in production I think it's even
>>     better if I do it, then I will have a better sense of how the
>>     clockwork ticks. I'll attempt to document as I go. Maybe we can
>>     setup some uptime monitoring this time around (cc: systems@ for
>>     this purpose). It would be helpful to coincide when you are online
>>     on this task. For me a good time would be starting Friday 17th at
>>     21:00 (UTC -5 / Bogota) - but I'm open to accomodate to your
>>     timezone/schedule/convenience. This way we can test over the
>>     weekend and have a working service by Monday 20th. Let me know so
>>     we can announce the planned maintenance downtime. We've gotten
>>     this far and have engaged some active users. I think there is a
>>     bright future for Sugar Network. We just need to keep rowing.
>>     Thanks for your commitment. 
>>
>> The issue is not w/ SN node in particular but with Apache connections
>> pool, next time restart Apache. Last time we decided to not migrate
>> node.sl.o to intermediate code base release. So, if the only issue is
>> unavailable node, it could be started on separate IP out of Apache. If
>> I got it right, it is possible to grant jita new external IP, so, we
>> need to ask Bernie. Then, 1. node.sugarlabs.org DNS should be
>> re-pointed to the new IP 2.
>> /srv/sugar-network/.config/sugar-network/config should be tuned:
>> [node] host = <NEWIP> port = 80 3.
>> /etc/apache2/sites-enabled/node.sugarlabs.org should be tuned:
>> <VirtualHost *:80> ServerName node.sugarlabs.org ProxyPass /
>> http://<NEWIP>:80/ ProxyPassReverse / http://<NEWIP>:80/
>> </VirtualHost> 4. sugar-network-node restarted and Apache reloaded. --
>> Aleksey 


-- 
 _ // Bernie Innocenti
 \X/  http://codewiz.org


More information about the Systems mailing list