[Systems] [mirror discuss] Re: mirrorbrain for sugar labs

David Farning dfarning at sugarlabs.org
Sun Sep 27 20:07:41 EDT 2009


On Fri, Sep 25, 2009 at 6:14 AM, Peter Pöml <poeml at cmdline.net> wrote:
> Hi!
>
> On 25.09.2009, at 02:40, David Farning wrote:
>
>> Peter ask me to continue a private thread on this mailing list.  Also
>> CCing Matthew Zeier from mozilla infrastructure.  I was looking for
>> Mozilla's solution and he pointed me in the direction of mirrorbrain.
>>
>> On Thu, Sep 24, 2009 at 5:25 PM, Peter Pöml <poeml at cmdline.net> wrote:
>>>
>>> Hi David!
>>>
>>> thank you for writing. Interesting to learn about Sugar. It sounds
>>> exciting!
>>>
>>> Would you mind me resending my reply with the MirrorBrain mailing list
>>> Cc'ed, and continue discussion there? I think it would be great material
>>> for
>>> the list and could provide insight to others. It's also great to see some
>>> activity there :-) (If not, no problem at all.)
>>
>> done
>
> Thank you. (I actually had the other mailing list in mind, mirrorbrain@, but
> it doesn't matter much, the same people are subscribed there - the discuss
> list was meant more for discussion of mirror issues regardless of
> MirrorBrain; I should have said that. But I guess it doesn't matter that
> much!)
>
>>> On 24.09.2009, at 22:33, David Farning wrote:
>>>>
>>>> I am looking at using mirrorbrain as the CDN for wki.sugarlabs.org .
>>>> We are still pretty small we generally have 200G per day but peak at
>>>> 32000G per day during releases.
>>>
>>> That's not nothing ;) I'd say it is an amount where a carefully set up
>>> infrastructure with mirrors makes sense. Also, it sounds like there would
>>> be
>>> a lot of users that one wants to keep happy, and who would benefit from
>>> every improvement. And from looking
>>
>> We are currently running our infrastructure from the FSF's colocation
>> facility.  So I include keeping our generous host happy pretty high on
>> the list.
>
>
> Yes, understandably.
>
>
>>>> On normal day the majority of our traffic comes from
>>>> activities.sugarlabs.org . a.sl.o is based off of mozilla's amo so
>>>> anything we do here help them.
>>>
>>> I see, http://activities.sugarlabs.org/ is very similar as
>>> https://addons.mozilla.org/, and it offers download links to lots of .xo
>>> files, and redirects to
>>> http://download.sugarlabs.org/sources/activities/ from where the files
>>> are
>>> downloaded.
>>> For now, I only note the redirection to d.sl.o, and no further
>>> redirection
>>> from there.
>>>
>>> I also see other downloads, like
>>> http://wiki.sugarlabs.org/go/Sugar_on_a_Stick which links to some
>>> mirrors.
>>>
>>>> We have a small collection of mirrors that help us during releases.
>>>> But, the user must manually chose between mirrors. Agggg.
>>>
>>> Okay, so from what I would guess at this stage is that d.sl.o could
>>> redirect
>>> to the mirrors, instead of delivering the .xo files all by itself;
>>> correct?
>>> That would be exactly where MirrorBrain is could step in.
>>
>>
>> Yes, the two main pieces are the sugar on a stick images and the .xo
>> files.
>
>
> Okay.
>
>
>>>> My questions are:
>>>> 1. Is it worth it to use mirrorbrain at this stage?  Particularly
>>>> around releases.
>>>
>>> Yes, definitely, the only thing to keep in mind is that deploying it
>>> costs
>>> time, but I would think that it is worth the effort. If you have very few
>>> mirrors, it can be the life-saver for the releases -- and if you
>>> gradually
>>> get more mirrors, it will improve the service quality for the end users
>>> because they can usually be routed to a better mirror.
>>
>> Yes,  this is particularly important because many of our large
>> deployments are in remote regions.  Something like 80% of our .xo
>> traffic is from Uruguay.
>
>
> I see.
>
>
>>> The effort in deployment is mainly in building and installing the
>>> software
>>> and its different components. This is certainly doable and I'm happy to
>>> help
>>> with it. If you run, say, purely on a CentOS5 based shop with aged Apache
>>> and complicated deployment procedures, it can be difficult, but d.sl.o
>>> rather seems to run Apache/2.2.11 on Ubuntu, which means that Apache is
>>> new
>>> enough, and everything else will be available as well I guess. I would
>>> actually like to build MirrorBrain packages on Ubuntu, and that might be
>>> a
>>> reason to do that maybe?
>>
>> Everything except the build farm is Ubuntu.  Ubuntu packages would be
>> nice.  But I am willing to build from scratch.
>
>
> Which Ubuntu version specifically? In the openSUSE build service, I can
> build for 8.04, 8.10, and 9.04. It would also be interesting for me to
> become a real Debian package maintainer, but using the openSUSE build
> service might be the quicker route for now. I managed to build mod_asn for
> Debian and Ubuntu already (see
> http://download.opensuse.org/repositories/Apache:/MirrorBrain/xUbuntu_9.04/),
> and I'm confident that I could do the same for mod_mirrorbrain and stuff
> that you would need. Those package would be updated then from a single
> source together with the various RPM packages that are built, which would be
> of great convenience later.
>
> Most needed dependencies should already be available for a modern
> Debian/Ubuntu system. One thing that may be needed to be double-checked is
> mod_geoip. It seems that this module is very outdated -
> http://packages.qa.debian.org/liba/libapache2-mod-geoip.html has an 1.1.x
> version, and there is a newer package waiting in
> http://mentors.debian.net/debian/pool/main/l/libapache2-mod-geoip/ but even
> that is already 1.5 years old.
>
>
>>>> 2. How will mirror brain interact will a.sl.o(AMO)?  Will new
>>>> activites just be served from that primary node until mirrorbrain runs
>>>> a scan to verify the the new activite has been rsynced to a mirror
>>>> node.
>>>
>>> MirrorBrain needs the file tree locally and can work off it as a normal
>>> Apache. If it doesn't know a mirror for a file, Apache will just deliver
>>> it
>>> as normal; if a mirror is known, Apache will redirect to it. Therefore,
>>> publishing new files is just a matter of putting them into the file tree.
>>> Later, mirrors will catch up, and as soon as they are scanned, Apache
>>> will
>>> know about the presence on the mirrors and redirect to them.
>>
>> Ok great, so then we can modify the rsync so that only popular files
>> are mirrored.  a.sl.o keeps every version of an activity in the main
>> tree for historical purposes.  But there is no reason to keep copies
>> on the mirrors.
>
>
> Yes, this makes sense.
>
>
>>> If large amounts of content are published at once, it can be useful (or
>>> even
>>> needed) to first publish them only for the mirrors, by putting them into
>>> a
>>> stage area that they can access, and later update Apache's file tree,
>>> when
>>> they are distributed enough. Another regime (useful if the file tree is
>>> large and gets frequent, small updates) could be to push-sync files as
>>> soon
>>> they come in, and directly scan after each push.
>>
>> Ok, we can figure that out.  It would be cool if a.sl.o could trigger
>> the push when ever a new activity is added.
>
>
> I started working on some kind of framework for this purpose, because the
> same need arose at openSUSE in the past, and there it was implemented with
> some simple (and hard to maintain) shell scripts. I am thinking of a Django
> web app to configure the pushes for mirrors, and a little job queue that
> runs the push syncs, and which is triggered by e.g. XML-RPC or REST
> interface, or by inotifies directly from the filesystem.
>
> The web frontend part I have almost implemented, and I've put some
> screenshots here to make the idea a little visible:
>
> http://www.poeml.de/~poeml/MirrorSync/mirrors.png
> http://www.poeml.de/~poeml/MirrorSync/modules.png
> http://www.poeml.de/~poeml/MirrorSync/excludes.png
>
> This is not of much practical use yet, but it might be an interesting path
> to go in the future. It's definitely something that other people/projects
> also have a need for, so a reusable and simple framework could be useful I
> thought.
>
> (The code is in a private SVN repository so far, just because I was
> experimenting with live data and needed to have passwords in the database)
>
>
>>> Maybe there is even an existing release infrastructure that one could
>>> integrate with.
>>
>> We are not that fancy yet.
>>
>>>> 3. How does mirrorbrain work with mysql? Do the admin framework and
>>>> tool set work with mysql yet?
>>>
>>> At the beginning of this year, I abandoned MySQL support in all the
>>> tools,
>>> but the core (the mod_mirrorbrain Apache module) will work. The tools to
>>> maintain the mirror database won't work, and while this could probably be
>>> fixed, I can say that when the list of mirrors is not long, and one is
>>> proficient in the mysql commandline, it is certainly possible to maintain
>>> the mirror data manually with the mysql client. I did so for a long time
>>> in
>>> fact, before I finally started to write some tools.
>>>
>>> I would recommend to use PostgreSQL because that will result in a setup
>>> that
>>> is clean and as documented, and also the database will be self-contained
>>> and
>>> low-maintenance enough that it would matter much to anyone which database
>>> is
>>> used underneath.
>>>
>>> However, mod_mirrorbrain will happily use MySQL as file database. I am
>>> *quite* sure that the scanner script also still works with MySQL, but I
>>> can't promise, as I haven't tested it since I did the switch to
>>> PostgreSQL.
>>>
>>> I decided to switch to PostgreSQL because Apache's DBD framework cannot
>>> use
>>> two different databases in one vhost yet, and I needed a special datatype
>>> in
>>> PostgreSQL to implement mod_asn (which you won't need with only few
>>> mirrors;
>>> don't bother to install it). I was aware that it might put off some
>>> people
>>> that are more familiar with MySQL, but I can speak very positively about
>>> PostgreSQL, it is a great piece of software and it was a pleasurable
>>> experience to me to get acquainted with it. I am happy to help with that;
>>> it's not difficult, just a little different.
>>
>> Using postgresSQL is not a blocker.  So we can worry about that later.
>>
>>> It would of course be an option to re-implement MySQL support and
>>> PostgreSQL
>>> at the same time, but my time has been to scarce so far to even consider
>>> this, as there are other things that would seem more important, as e.g.
>>> the
>>> lack of a web interface, that I would like to tackle.
>>>
>>>
>>> Does this help further?
>>
>> So, I guess my next steps are:
>> 1. set up a opensuse VM and install mirrorbrain to see how it is
>> suppose to work.
>
>
> I once created a VirtualBox image based on openSUSE 11.1, which may be the
> quickest way to have a look:
> http://mirrorbrain.org/news/mirrorbrain-eval-virtualbox-appliance/
> It contains a complete install and one or two (Firefox) mirrors set up, and
> it should allow you to immediately play with Apache as well as with the "mb"
> admin tool (see http://mirrorbrain.org/docs/mirrors/).
>
> You could adjust the path to the file tree in the Apache configuration (see
> /etc/apache2/vhosts.d/*.conf), rsync a copy of the file tree into the image,
> add your mirrors to the database, scan them and you should have a working
> redirector then.
>
>> 2. Set up a ubuntu VM matching the sugar labs infrastructure and
>> install mirrorbrain.
>>
>> I'll try to do that this weekend.  I am sure I will have questions
>
>
> As happy as I would be to directly assist you with it, I'll be away for the
> weekend unfortunately (and leave now). But I'm back on Monday!
>
Thanks Peter

As promised, I created a recipe to to set up mirrorbrain for Sugar
Labs at http://wiki.sugarlabs.org/go/Infrastructure_Team/Content_Delivery_Network
.

I have asked our Bernie, the Sugar Labs sysadmin, to set up a ubuntu
9.04 vm so I can do more testing this week.  I have three vms set up
as mirrors and one vm set up  as the mirror brain on my desktop.  I am
downloading to thee laptops on a wired network.  The current bottle
neck is the laptop harddrives.

Every thing looks good so far.  Thanks for all your help.

david


More information about the Systems mailing list