I would like to change the setup of our mirror crawler and just wanted to mention my planned changes here before working on them. Currently we have two VMs which are crawling our mirrors. Each of the machine is responsible for one half of the active mirrors. The crawl starts every 12 hours on the first crawler and 6 hours later on the second crawler. So every 6 hours one crawler is accessing the database. Currently most of the crawling time is not spent crawling but updating the database about which host has which directory up to date. With a timeout of 4 hours per host we are hitting that timeout on some hosts regularly and most of the time the database access is the problem. What I would like to change is to crawl each category (Fedora Linux, Fedora Other, Fedora EPEL, Fedora Secondary Arches, Fedora Archive) separately and at different times and intervals. We would not hit the timeout as often as now as only the information for a single category has to be updated. We could scan 'Fedora Archive' only once per day or every second day. We can scan 'Fedora EPEL' much more often as it is usually really fast and get better data about the available mirrors. My goal would be to distribute the scanning in such a way to decrease the load on the database and to decrease the cases of mirror auto-deactivation due to slow database accesses. Let me know if you think that these planned changes are the wrong direction of if you have other ideas how to improve the mirror crawling. Adrian
Attachment:
signature.asc
Description: PGP signature
_______________________________________________ infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx