[Systems] Disk array issues on cloud9.fsf.org affecting sunjammer
Bernie Innocenti
bernie at codewiz.org
Fri May 22 11:21:00 EDT 2020
Thank you Ruben for making this happen smoothly and also for the
detailed report.
On 20/05/2020 10.52, Ruben wrote:
> As a follow-up, I was able to migrate the VM into a new host (the
> gnuhope server cluster, there are 3 hosts with live migration, currently
> running at kvmhost3.fsf.org). Some notes on the migration:
>
> * I was able to copy over all of the data, with the exception of
> /root/develgrep_err.txt which was partially unreadable due to IO errors.
>
> * The new host runs KVM instead of XEN. No grub configuration is
> necessary, KVM will boot whatever kernel and initrd are linked as
> /vmlinuz and /initrd.img
>
> * The new storage volumes are /dev/sda for / and /dev/sdb for /srv
>
> * I expanded the filesystems a bit, they are at 75% usage. Please
>
> * The storage is self-encrypted using Luks. This should work
> transparently to you, and the only requirement is to keep cryptsetup and
> dmsetup installed. You can ask me to expand on this if you are curious.
I'm curious: how does the system read /boot/keyscript.sh before / is
mounted?
My guess: it's copied into initrd, which is somehow available to kvm
outside the VM.
And so this calls for another question: are the keys inaccessible to
someone who steals the entire drive?
> * I enabled quotas but I did no further work to recreate the original
> limits. You can find the quota information on /root/repquota
> If you need more information collected from the original filesystem, it
> will be accessible to FSF sysadmins for some time, but we do plan to
> decomision the hardware soon, so please verify it.
A quick 'repquota -a' shows that the old quotas are still active. They
were carried over in the regular file /srv/aquota.user.
Perhaps the system launched quotacheck on its first boot.
> * The same applies to ACLs, I made no attempts to migrate them.
That would have been simple, just add -A to rsync.
I don't remember what were using ACLs for, but I there might have been
something requiring them in /srv/upload. Bah, I guess we can just wait
for someone to complain that uploads are broken :-)
> * The storage is backed by 3x replicated Ceph, with 20gbps links. It
> should be noticeable faster than the previous hardware. If the machine
> is used for IO intensive operations like package building or CI, we
> should consider setting a IO limit on the VM, to not cause trouble on
> other services.
These days sunjammer runs some web services, email, mailing lists and a
secondary dns. Nothing particularly disk intensive.
> * I removed two filesystem "optimization" flags that were preventing the
> machine from booting: data=writeback,barrier=0
> If those flags are actually needed, please contact me to look into it,
> it may not reboot on its own if that is changed. If those flags were
> applied to improve build performance I recommend using libeatmydata for
> build processes, instead of global filesystem flags.
Yeah, that was a desperate attempt to fix the pathetic I/O performance
of the old server. Without those, running sync could block for 20
seconds or more.
> * We do not have any backup set on this system at the FSF.
I double-checked, and backups are still running regularly from
freedom.sugarlabs.org.
> * There was no firewall running in the previous incarnation of the
> server, it is advisable that you review this and install a firewall if
> needed. It is strongly recommended to configure sshd to allow key
> authentication only, and reject password entry attempts.
sshd already disallows password authentication (PasswordAuthentication no).
> * We do not do performance or alert monitoring for this server.
There's some rudimentary monitoring and paging in place:
http://munin.sugarlabs.org/sugarlabs.org/sunjammer.sugarlabs.org/index.html
Since the munin server lives on sunjammer, I didn't get paged for the
downtime caused by the migration :-)
> Please review the previous points, and any other issue that may be
> caused by the migration, and if you find anything please send one email
> per distinct problem to sysadmin at gnu.org to create individual tickets
> for FSF sysadmins.
>
> Also note that we moved the VM to the current host cluster because it
> was the quickest available option and the disks were faulty. We have not
> discussed yet whether that host is appropriate for the long run, and we
> may decide to migrate it to a different setup at a later date. If that
> is the case we will coordinate that with SugarLabs.
>
> Cheers,
--
_ // Bernie Innocenti
\X/ https://codewiz.org/
More information about the Systems
mailing list