[Systems] Disk array issues on cloud9.fsf.org affecting sunjammer

Bernie Innocenti bernie at codewiz.org
Fri May 22 11:21:00 EDT 2020


Thank you Ruben for making this happen smoothly and also for the 
detailed report.

On 20/05/2020 10.52, Ruben wrote:
> As a follow-up, I was able to migrate the VM into a new host (the
> gnuhope server cluster, there are 3 hosts with live migration, currently
> running at kvmhost3.fsf.org). Some notes on the migration:
> 
> * I was able to copy over all of the data, with the exception of
> /root/develgrep_err.txt which was partially unreadable due to IO errors.
> 
> * The new host runs KVM instead of XEN. No grub configuration is
> necessary, KVM will boot whatever kernel and initrd are linked as
> /vmlinuz and /initrd.img
> 
> * The new storage volumes are /dev/sda for / and /dev/sdb for /srv
> 
> * I expanded the filesystems a bit, they are at 75% usage. Please
> 
> * The storage is self-encrypted using Luks. This should work
> transparently to you, and the only requirement is to keep cryptsetup and
> dmsetup installed. You can ask me to expand on this if you are curious.

I'm curious: how does the system read /boot/keyscript.sh before / is 
mounted?

My guess: it's copied into initrd, which is somehow available to kvm 
outside the VM.

And so this calls for another question: are the keys inaccessible to 
someone who steals the entire drive?


> * I enabled quotas but I did no further work to recreate the original
> limits. You can find the quota information on /root/repquota
> If you need more information collected from the original filesystem, it
> will be accessible to FSF sysadmins for some time, but we do plan to
> decomision the hardware soon, so please verify it.

A quick 'repquota -a' shows that the old quotas are still active. They 
were carried over in the regular file /srv/aquota.user.

Perhaps the system launched quotacheck on its first boot.


> * The same applies to ACLs, I made no attempts to migrate them.

That would have been simple, just add -A to rsync.

I don't remember what were using ACLs for, but I there might have been 
something requiring them in /srv/upload. Bah, I guess we can just wait 
for someone to complain that uploads are broken :-)


> * The storage is backed by 3x replicated Ceph, with 20gbps links. It
> should be noticeable faster than the previous hardware. If the machine
> is used for IO intensive operations like package building or CI, we
> should consider setting a IO limit on the VM, to not cause trouble on
> other services.

These days sunjammer runs some web services, email, mailing lists and a 
secondary dns. Nothing particularly disk intensive.


> * I removed two filesystem "optimization" flags that were preventing the
> machine from booting: data=writeback,barrier=0
> If those flags are actually needed, please contact me to look into it,
> it may not reboot on its own if that is changed. If those flags were
> applied to improve build performance I recommend using libeatmydata for
> build processes, instead of global filesystem flags.

Yeah, that was a desperate attempt to fix the pathetic I/O performance 
of the old server. Without those, running sync could block for 20 
seconds or more.


> * We do not have any backup set on this system at the FSF.

I double-checked, and backups are still running regularly from 
freedom.sugarlabs.org.


> * There was no firewall running in the previous incarnation of the
> server, it is advisable that you review this and install a firewall if
> needed. It is strongly recommended to configure sshd to allow key
> authentication only, and reject password entry attempts.

sshd already disallows password authentication (PasswordAuthentication no).


> * We do not do performance or alert monitoring for this server.

There's some rudimentary monitoring and paging in place:

http://munin.sugarlabs.org/sugarlabs.org/sunjammer.sugarlabs.org/index.html

Since the munin server lives on sunjammer, I didn't get paged for the 
downtime caused by the migration :-)


> Please review the previous points, and any other issue that may be
> caused by the migration, and if you find anything please send one email
> per distinct problem to sysadmin at gnu.org to create individual tickets
> for FSF sysadmins.
> 
> Also note that we moved the VM to the current host cluster because it
> was the quickest available option and the disks were faulty. We have not
> discussed yet whether that host is appropriate for the long run, and we
> may decide to migrate it to a different setup at a later date. If that
> is the case we will coordinate that with SugarLabs.
> 
> Cheers,

-- 
_ // Bernie Innocenti
  \X/  https://codewiz.org/


More information about the Systems mailing list