One of my next goals to improve mirror crawling is to split the crawls of the mirrors by category. Right now we select a mirror and crawl all categories (Fedora Linux, Fedora EPEL, Fedora Secondary, Fedora Archives, Fedora Other) in one go. The drawback is that it is nearly impossible to crawl a mirror which mirrors everything within the time limit of 3 hours. There are a few mirrors which actually mirror everything and they are usually dropped from the mirror list because the crawler always hits the 3 hour limit and marks the mirror as not being up to date. The current solution is to create multiple hosts (which can point to the same mirror) with only one or two categories. This works but it is not the optimal solution. The actual scanning of the remote mirror is most of the time not the real problem, but also updating the status of all those directories and files in the local database takes a very long time. The master crawling by update-master-directory-list (umdl) is already split up by category and fedmsg driven (for most categories). So whenever a repository is updated umdl starts a scan and updates the database for only the category which has changed. This works pretty good but has the disadvantage that the database is now much faster updated without the possibility for the mirrors to sync before we have new information in the database. The reason for this long introduction is that my original plan was to immediately start a category crawl after umdl has signalled that a certain category has been updated in the database. This could lead to a very short list of mirrors which are up to date and therefore I would like to know if we should somehow introduce a delay between the time umdl has run and the time we start to crawl the mirrors. This would give the mirrors some time to sync the content before we crawl them. Right now the time between the update of the master mirror and the crawl can be between 0 hours and 12 hours. With a defined time before crawling the mirrors this would be more clearer than right now. I am also hoping to be able to crawl the mirrors more often than twice a day if moving to category based crawls. So my main question is if we should insert a delay between umdl and the crawl of the mirrors? This would require a fedmsg emitted at the end of an umdl run and something on the crawler which waits some time before starting the crawls. Adrian
Attachment:
pgpct3k92OPZu.pgp
Description: PGP signature
_______________________________________________ infrastructure mailing list infrastructure@xxxxxxxxxxxxxxxxxxxxxxx http://lists.fedoraproject.org/postorius/infrastructure@xxxxxxxxxxxxxxxxxxxxxxx