Re: How mirrormanager works

Matt Domsch <matt@xxxxxxxxxx> · Sun, 4 May 2014 08:00:11 -0500

On Sat, May 03, 2014 at 11:40:49PM +0200, Pierre-Yves Chibon wrote:
> Hi Matt,
> 
> I would like to run by you my understanding of how mirrormanager works, I hope 
> I am not to far from the truth but please if I am let me know :)
> 
> Mirrormanager is splitted into three parts: the UI, the API and a cron task.

There are several cronjobs, but basically correct.

> The UI is the current TurboGears1 application [1]. People can login in there and
> register an institution (Site) with one or more sub-domain (Host) each mirroring
> one or more Product (Fedora, EPEL...).
> For each sub-domain (Host), there are a number of settings available such as
> from
> where the mirror pulls its updates, who can pull from it, restricting the user's
> country or network.
> Users manage their mirror there, admin can access any Site/Host and update their
> settings as well.

Correct.  This is the ugliest part of MM1.x, I will be most glad for
its complete rewrite.

> The API is a cgi script, called in our case by yum, which redirects the user to
> the closest mirror, active, up-to-date and according to the settings of the
> mirror.

Correct.  This (mirrorlist-server directory) is almost entirely
divorsed from the TG 1.x components.  It does not touch the database
directly in any way, but is handed a pickle containing a precalculated
cache of the database, which can then be distributed to each of the
mirrorlist-server servers.  As long as you don't break the format of
the pickle, you don't have to change this code.

> The cron task, runs daily and crawls the mirror to check if they are
> up to date or not and register in the database which folder are up
> to date and which are not for every single mirror.  This cron task
> (or is it another?) also generates the publiclist pages [2] using
> the data it just retrieved about each mirror.

There are several different cronjobs.

1) crawlers which checks every directory on every mirror every few
(~2-4) hours, and updates the database with which directories on each
mirror are up-to-date.  We don't mark whole servers as stale, but
individual directories on individual servers.

2) generate the publiclist pages.

3) generate the mirrorlist-server pickle.

4) retrieve some external network routing data from public routers
(both Internet2 and Internet1) used to identify which servers and
clients are on which autonomous systems, and to select the proper
server.

5) get the monthly GeoIP database from maxmind.  This isn't part of MM
proper but is necessary to run.

> I have been trying to figure out how exactly are retrieved the mirrors on
> publiclist. I found the query used by MM1 [3] and I was wondering if there would
> not be a way to simplify that. Would it be an option to have the cron task
> setting a flag on the database saying if a host is up to date or not?
> Or am I missing some of the information?

It is complicated because there are two main cases:

1) we track mirror freshness by individual directory, not by whole
mirror.  We will put a mirror on the publiclist if at least one of its
directories has up-to-date content.  The liklihood of a mirror having
at least one directory stale throughout any given day is quite high as
content changes on the masters frequently; I don't want to keep
adding and removing mirrors from the publiclist at that frequency.

To change to a host-global up-to-date flag would certainly simplify
this logic, at a loss of granularity.

2) a few mirrors are special, they have the 'always_up2date' flag set
on their HostCategory.  These are particularly Fedora master mirrors
that we don't want to crawl, but do want them to appear in the
publiclist.  A second case is "trusted" mirrors (e.g. the ones I run
at Dell) which because of firewalls cannot be hit by the crawlers, but
can serve content internally just fine.  Those are also marked
"always_up2date" and it's incumbant on me to make that assertion true.

> If we could simplify this part we could drop the cron task generating the
> publiclist pages and just display them on the fly as part of the UI.

I'd prefer not, but I'll leave that to the collective new maintainance
team to determine the desired output. :-)

> As part of the re-write, there is of course the UI since it is TurboGears1, but
> the CGI script and the cron tasks should not need much changes, would they?

Some of the cron tasks call into the TG1 model/controller code directly, so yes they
would.  The CGI would not.

There are a few other "helper apps" like "move-development-to-release"
which also call into the TG1 model/controller code and would need to
be updated.

Thanks,
Matt
_______________________________________________
infrastructure mailing list
infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/infrastructure