Master and mirror crawling

Adrian Reber <adrian@xxxxxxxx> · Fri, 11 Sep 2015 16:56:41 +0200

One of my next goals to improve mirror crawling is to split the crawls
of the mirrors by category. Right now we select a mirror and crawl all
categories (Fedora Linux, Fedora EPEL, Fedora Secondary, Fedora
Archives, Fedora Other) in one go. The drawback is that it is nearly
impossible to crawl a mirror which mirrors everything within the time
limit of 3 hours. There are a few mirrors which actually mirror
everything and they are usually dropped from the mirror list because the
crawler always hits the 3 hour limit and marks the mirror as not being
up to date. The current solution is to create multiple hosts (which can
point to the same mirror) with only one or two categories. This works
but it is not the optimal solution.

The actual scanning of the remote mirror is most of the time not the
real problem, but also updating the status of all those directories and
files in the local database takes a very long time.

The master crawling by update-master-directory-list (umdl) is already
split up by category and fedmsg driven (for most categories). So
whenever a repository is updated umdl starts a scan and updates the
database for only the category which has changed. This works pretty good
but has the disadvantage that the database is now much faster updated
without the possibility for the mirrors to sync before we have new
information in the database.

The reason for this long introduction is that my original plan was to
immediately start a category crawl after umdl has signalled that a certain
category has been updated in the database. This could lead to a very
short list of mirrors which are up to date and therefore I would like to
know if we should somehow introduce a delay between the time umdl has
run and the time we start to crawl the mirrors. This would give the
mirrors some time to sync the content before we crawl them.

Right now the time between the update of the master mirror and the crawl
can be between 0 hours and 12 hours. With a defined time before crawling
the mirrors this would be more clearer than right now.

I am also hoping to be able to crawl the mirrors more often than twice a
day if moving to category based crawls.

So my main question is if we should insert a delay between umdl and the
crawl of the mirrors? This would require a fedmsg emitted at the end of
an umdl run and something on the crawler which waits some time before
starting the crawls.

		Adrian
Attachment:
pgpct3k92OPZu.pgp

Description: PGP signature
_______________________________________________
infrastructure mailing list
infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
http://lists.fedoraproject.org/postorius/infrastructure@xxxxxxxxxxxxxxxxxxxxxxx