FBR: Change to category based crawling

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



As discussed previously I would like to change the crawler to crawl each
category separately. The goal is to reduce the load on the database by
distributing the crawling better over the whole day and to reduce the
chance of mirrors being disabled because of the high database load.

This should also remove the need for mirror administrators to create
multiple hosts in MirrorManager to work around the 4 hours timeout per
host.

Attached is my patch. Please +1. This affects mm-crawler01 and
mm-crawler02.

		Adrian
From b10d5ffa7e48e934da3350186eaf8dd4fb0cebf3 Mon Sep 17 00:00:00 2001
From: Adrian Reber <adrian@xxxxxxxx>
Date: Tue, 17 Apr 2018 19:42:16 +0200
Subject: [PATCH] mirror crawler: crawl each category separately

This is the first try to split up the mirror crawling by category. One
of the goals is to better distribute the load on the database. If this
actually works the effects of this change have to be monitored.

Another result could be that mirrors do not get auto-deactivated that
fast. Previously there was a crawl timeout of 4 hours for all categories
together. Now it is 4 hours per category.
---
 .../mirrormanager/crawler/files/crawler.cron  | 27 ++++++++++++++++---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/roles/mirrormanager/crawler/files/crawler.cron b/roles/mirrormanager/crawler/files/crawler.cron
index 33f3967f3..24d141574 100644
--- a/roles/mirrormanager/crawler/files/crawler.cron
+++ b/roles/mirrormanager/crawler/files/crawler.cron
@@ -1,4 +1,4 @@
-# run the crawler twice a day
+# run the crawler for each MirrorManager category
 # logs sent to /var/log/mirrormanager/crawler.log and crawl/* by default
 #
 # [ "`hostname -s`" == "mm-crawler02" ] && sleep 6h is used to start the crawl
@@ -10,5 +10,26 @@
 # gracefully shutdown if it gets the signal SIGALRM(14).  After the signal we
 # wait for 5 minutes to give the crawler a chance to shutdown. After that the
 # crawler is killed.  To make sure we only end the cron started crawler we look
-# for the following process "/usr/bin/python /usr/bin/mm2_crawler --threads 25".
-0 */12 * * * mirrormanager [ "`hostname -s`" == "mm-crawler02" ] && sleep 6h; pkill -14 -f "^/usr/bin/python2 -s /usr/bin/mm2_crawler --threads 20"; sleep 5m; pkill -9 -f "^/usr/bin/python2 -s /usr/bin/mm2_crawler --threads 20"; /usr/bin/mm2_crawler --threads 20 --timeout-minutes 240 `/usr/local/bin/run_crawler.sh 2` > /dev/null 2>&1
+# for the following process "/usr/bin/python /usr/bin/mm2_crawler --category=25".
+
+# The number of threads is based on the possible number of existing mirrors. More
+# threads for categories with more mirrors.
+
+# The goal is to distribute the crawling of all categories over the whole day.
+
+# The timeout is 4 hours, but for each category.
+
+# Category: 'Fedora Linux'; twice a day, 20 threads
+0 */12 * * * mirrormanager [ "`hostname -s`" == "mm-crawler02" ] && sleep 6h; pkill -14 -f "^/usr/bin/python2 -s /usr/bin/mm2_crawler --category=Fedora Linux"; sleep 5m; pkill -9 -f "^/usr/bin/python2 -s /usr/bin/mm2_crawler --category=Fedora Linux"; /usr/bin/mm2_crawler --category="Fedora Linux" --threads 20 --timeout-minutes 240 `/usr/local/bin/run_crawler.sh 2` > /dev/null 2>&1
+
+# Category: 'Fedora Secondary Arches'; twice a day, 10 threads
+0 3,9 * * * mirrormanager [ "`hostname -s`" == "mm-crawler02" ] && sleep 1h; pkill -14 -f "^/usr/bin/python2 -s /usr/bin/mm2_crawler --category=Fedora Secondary Arches"; sleep 5m; pkill -9 -f "^/usr/bin/python2 -s /usr/bin/mm2_crawler --category=Fedora Secondary Arches"; /usr/bin/mm2_crawler --category="Fedora Secondary Arches" --threads 10 --timeout-minutes 240 `/usr/local/bin/run_crawler.sh 2` > /dev/null 2>&1
+
+# Category: 'Fedora EPEL'; four times a day, 20 threads
+45 */6 * * * mirrormanager [ "`hostname -s`" == "mm-crawler02" ] && sleep 1h; pkill -14 -f "^/usr/bin/python2 -s /usr/bin/mm2_crawler --category=Fedora EPEL"; sleep 5m; pkill -9 -f "^/usr/bin/python2 -s /usr/bin/mm2_crawler --category=Fedora EPEL"; /usr/bin/mm2_crawler --category="Fedora EPEL" --threads 20 --timeout-minutes 240 `/usr/local/bin/run_crawler.sh 2` > /dev/null 2>&1
+
+# Category: 'Fedora Archive'; once a day, 10 threads
+0 2 * * * mirrormanager [ "`hostname -s`" == "mm-crawler02" ] && sleep 6h; pkill -14 -f "^/usr/bin/python2 -s /usr/bin/mm2_crawler --category=Fedora Archive"; sleep 5m; pkill -9 -f "^/usr/bin/python2 -s /usr/bin/mm2_crawler --category=Fedora Archive"; /usr/bin/mm2_crawler --category="Fedora Archive" --threads 10 --timeout-minutes 240 `/usr/local/bin/run_crawler.sh 2` > /dev/null 2>&1
+
+# Category: 'Fedora Other'; once a day, 10 threads
+0 14 * * * mirrormanager [ "`hostname -s`" == "mm-crawler02" ] && sleep 6h; pkill -14 -f "^/usr/bin/python2 -s /usr/bin/mm2_crawler --category=Fedora Other"; sleep 5m; pkill -9 -f "^/usr/bin/python2 -s /usr/bin/mm2_crawler --category=Fedora Other"; /usr/bin/mm2_crawler --category="Fedora Other" --threads 10 --timeout-minutes 240 `/usr/local/bin/run_crawler.sh 2` > /dev/null 2>&1
-- 
2.17.0

Attachment: signature.asc
Description: PGP signature

_______________________________________________
infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx

[Index of Archives]     [Fedora Development]     [Fedora Users]     [Fedora Desktop]     [Fedora SELinux]     [Yosemite News]     [KDE Users]

  Powered by Linux