[IAEP] [Systems] ALERT JUSTICE DOWN [24 hours without response] (!)

Stefan Unterhauser stefan at unterhauser.name
Thu Jun 18 19:18:47 EDT 2015


On Thu, Jun 18, 2015 at 3:14 PM, Samuel Greenfeld <samuel at greenfeld.org>
wrote:

> Unless the hardware is newer than I think it is, it likely is quite old.
>
> < 3 years


> OLPC's hardware in the Media Lab kept flaking out to the point most (all?)
> of it was eventually virtualized.
>
> How much would it cost to look into getting new hardware and/or using
> someone's virtualization platform?
>
> Sugar seems to change their setup a bit more than OLPC, so it may be worth
> investigating a scenario where resources could be spun up on demand.
>
>
> On Thu, Jun 18, 2015 at 3:09 PM, Bernie Innocenti <bernie at sugarlabs.org>
> wrote:
>
>> On 06/18/2015 03:01 PM, Gonzalo Odiard wrote:
>> > Any chance to check if disks are dying or there other reason for these
>> > instabilities?
>>
>> Nothing odd from smartctl, and anyway the server would keep responding
>> to pings even if both disks in the raid array were dead.
>>
>> +1 +2 +3 +...

on the console the system completely freezes



> So I'm thinking it's either a kernel bug, or unstable hardware.
>>
>>
>> > Gonzalo
>> >
>> > On Thu, Jun 18, 2015 at 3:56 PM, Bernie Innocenti <bernie at sugarlabs.org
>> > <mailto:bernie at sugarlabs.org>> wrote:
>> >
>> >     +systems@
>>
>
thanks bernie


> >
>> >     I rebooted justice from the management console and it's now
>> responding
>> >     to pings.
>> >
>> >     I couldn't view the screen capture and I had no time to go to the
>> Media
>> >     Lab to physically inspect the machine, so I don't understand the
>> >     root cause.
>> >
>> >     As reported by Dogi, Justice seems to crash every 1-2 months.
>>
>
more ~3 months

>     I suggest we try the following steps:
>> >
>> >     1. upgrade justice to Ubuntu 14.04 (like we did with freedom 1yr
>> ago)
>> >
>>
> +1 specially since justice compared to freedom has a long history of being
upgraded (it has its roots on housetree server build which means it lives
already since 2009 ... freedom got fresh installed ~2012/13)

this is why after that we should just consider a total new install of
justice, since my guess is that it is a software issue (justice always
lasts 2+ months)

why I think it is not a hardware issue is that this crashing is the case
already for the last 2 years (I did 95% of all reboots) and started with
our last system upgrade (something got upgraded to unstable)



> >     2. if crashes continue, go to the server room and swap the drives
>> with
>> >     freedom (which is our hot-swap server and doesn't currently run
>> anything
>> >     critical)
>> >
>> >     3. Ask again the ML to give one of us physical access to the server
>> >     room. I work nearby, but I have trouble leaving during office hours
>> on a
>> >     personal errand and if anything happens over a week-end we're in
>> >     trouble.
>> >
>> >     Sebastian: you should at least get access to the management console.
>> >     Ping me on IRC and I'll send you the credentials on a secure
>> channel.
>> >
>> >
>> >     On 06/18/2015 10:40 AM, Sebastian Silva wrote:
>> >     > Hello Sugar Oversight Board, Sugar Labs Members,
>> >     >
>> >     > Our main production server virtual machine host is down and I
>> can't
>> >     > reach it.
>> >     > We have several systems that depend on this infrastructure,
>> including
>> >     > pootle server which was actively being used by translators of
>> >     Aymara and
>> >     > Awajun native languages.
>> >     >
>> >     > I respectfully request that you call on the phone whoever has
>> physical
>> >     > access to this machine and we try to bring it back online. I think
>> >     this
>> >     > should be either Bernie Inocenti or Stefan Unterhauser.
>>
>
my phonenumber is 617 767 2668 ... just call


> >     >
>> >     > Also, I would like to request for more volunteers from
>> infrasctucure
>> >     > team to have virtual terminal access to these machines (not just
>> ssh),
>> >     > or to put them in a proper collocation service where we can get
>> some
>> >     > support.
>> >     >
>> >     > Thanks in advance for your help.
>> >     > Sebastian
>> >     >
>> >     > On 17/06/15 20:55, Sebastian Silva wrote:
>> >     >> Affected services:
>> >     >> translate.sugarlabs.org <http://translate.sugarlabs.org>
>> >     >> git.sugarlabs.org <http://git.sugarlabs.org>
>> >     >> packages.sugarlabs.org <http://packages.sugarlabs.org>
>> >     >>
>> >     >>
>> >     >>
>> >     >> On 17/06/15 20:48, Sebastian Silva wrote:
>> >     >>> We can't reach it.
>> >     >>>
>> >     >>> Anybody with physical access to the machine please respond.
>> >     >>>
>> >     >>>
>> >     >>> Regards,
>> >     >>> Sebastian
>> >
>> >     --
>> >     Bernie Innocenti
>> >     Sugar Labs Infrastructure Team
>> >     http://wiki.sugarlabs.org/go/Infrastructure_Team
>> >     _______________________________________________
>> >     IAEP -- It's An Education Project (not a laptop project!)
>> >     IAEP at lists.sugarlabs.org <mailto:IAEP at lists.sugarlabs.org>
>> >     http://lists.sugarlabs.org/listinfo/iaep
>> >
>> >
>> >
>> >
>> > --
>> > Gonzalo Odiard
>> >
>> > SugarLabs - Software for children learning
>>
>>
>> --
>> Bernie Innocenti
>> Sugar Labs Infrastructure Team
>> http://wiki.sugarlabs.org/go/Infrastructure_Team
>> _______________________________________________
>> IAEP -- It's An Education Project (not a laptop project!)
>> IAEP at lists.sugarlabs.org
>> http://lists.sugarlabs.org/listinfo/iaep
>>
>
>
> _______________________________________________
> Systems mailing list
> Systems at lists.sugarlabs.org
> http://lists.sugarlabs.org/listinfo/systems
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.sugarlabs.org/archive/iaep/attachments/20150618/a8042795/attachment.html>


More information about the IAEP mailing list