Freeze Break Request - reduce number of auto-disabled mirrors

Adrian Reber adrian at lisas.de
Wed May 13 12:17:24 UTC 2015


The new MM2 crawler disables mirrors which have failed to be
successfully crawled for 4 consecutive crawls. This seems to be a good
idea to reduce the total number of crawls by removing mirrors which are
just too slow. Unfortunately the current default timeout of 2 hours is
not enough. Especially for mirrors which mirror more than one category
as the timeout is per host and not per category. The problem is also not
network bound but it seems to be related to two crawlers updating the
directories of all mirrors on the same database at the same time. To
workaround this timeout problem I am now starting the crawler on the
second crawler 3 hours later and I have also increased the timeout from
2 hours to 3 hours. Additionally a small fix is included to also crawl
the last mirror in the database which was ignored until now.
After this is applied I would also re-enable the auto-disabled hosts
in the database.

Can I get two +1 for these changes?

Additionally I think we can remove the second crawler and just use the
times when the first crawler is idle to crawl other hosts. So instead of
starting a crawl every 12 hours on two crawlers we could crawl half of the
mirrors every 6 hours. But that is for after the freeze.

		Adrian

commit 0dc0b70d9790b95cb6c1f41b4d36fb9aa2c9fbfc
Author: Adrian Reber <adrian at lisas.de>
Date:   Wed May 13 11:23:21 2015 +0000

    Start the crawl later on the second crawler.
    
    Even with rsync as crawl method some hosts are taking a very long time
    to be crawled. The network connection with rsync is only open for a
    short time, but with both crawlers reading and writing from the database
    it takes a very long time until the status of all directories is
    updated. Therefore this patch introduces a 3 hour delay of the crawl
    on the second crawler. This could also be solved with two different
    cron.d files; one for each crawler.

diff --git a/roles/mirrormanager/crawler/files/crawler.cron b/roles/mirrormanager/crawler/files/crawler.cron
index 3d695ca..c74b915 100644
--- a/roles/mirrormanager/crawler/files/crawler.cron
+++ b/roles/mirrormanager/crawler/files/crawler.cron
@@ -1,4 +1,8 @@
 # run the crawler twice a day
 # logs sent to /var/log/mirrormanager/crawler.log and crawl/* by default
 # 32GB of RAM is not enough for 75 threads, 38 seems to work so far
-0 */12 * * * mirrormanager /usr/bin/mm2_crawler --threads 38 `/usr/local/bin/run_crawler.sh 2` > /dev/null 2>&1
+#
+# [ "`hostname -s`" == "mm-crawler02" ] && sleep 3h is used to start the crawl
+# later on the second crawler to reduce the number of parallel accesses to
+# the database
+0 */12 * * * mirrormanager [ "`hostname -s`" == "mm-crawler02" ] && sleep 3h; /usr/bin/mm2_crawler --threads 38 `/usr/local/bin/run_crawler.sh 2` > /dev/null 2>&1

commit 06309516b88ffade6f00c78833bc62aa002d7f56
Author: Adrian Reber <adrian at lisas.de>
Date:   Wed May 13 11:53:16 2015 +0000

    Increase crawler timeout from 2h to 3h.
    
    Since MM2 is in production about 140 mirrors have been auto-disabled
    due to crawler timing out after 2 hours (default). Try if it works
    better with 3 hours. This in combination with the previous commit
    to decrease the load on the database should help to auto disable
    less good mirrors. Especially mirrors who mirroring almost
    everything can hardly be crawled within the 2 hour limit. Unfortunately
    the limit is per host and not category.

diff --git a/roles/mirrormanager/crawler/files/crawler.cron b/roles/mirrormanager/crawler/files/crawler.cron
index c74b915..66801d7 100644
--- a/roles/mirrormanager/crawler/files/crawler.cron
+++ b/roles/mirrormanager/crawler/files/crawler.cron
@@ -5,4 +5,4 @@
 # [ "`hostname -s`" == "mm-crawler02" ] && sleep 3h is used to start the crawl
 # later on the second crawler to reduce the number of parallel accesses on
 # the database
-0 */12 * * * mirrormanager [ "`hostname -s`" == "mm-crawler02" ] && sleep 3h; /usr/bin/mm2_crawler --threads 38 `/usr/local/bin/run_crawler.sh 2` > /dev/null 2>&1
+0 */12 * * * mirrormanager [ "`hostname -s`" == "mm-crawler02" ] && sleep 3h; /usr/bin/mm2_crawler --timeout-minutes 180 --threads 38 `/usr/local/bin/run_crawler.sh 2` > /dev/null 2>&1

commit 78c19a35d1706eb43e4e0c5fa202e0a6549674de
Author: Adrian Reber <adrian at lisas.de>
Date:   Wed May 13 11:59:37 2015 +0000

    Also crawl the last mirror in the database.
    
    The last mirror in the database was not crawled and this adds '1' to
    the --stopid if necessary.

diff --git a/roles/mirrormanager/crawler/files/run_crawler.sh b/roles/mirrormanager/crawler/files/run_crawler.sh
index b9d642e..3269dea 100644
--- a/roles/mirrormanager/crawler/files/run_crawler.sh
+++ b/roles/mirrormanager/crawler/files/run_crawler.sh
@@ -26,4 +26,7 @@ for i in `seq 1 ${NUMBER_OF_CRAWLERS}`; do
         fi
         let STARTID=${STARTID}+${PART}
         let STOPID=${STOPID}+${PART}
+       if [ "${STOPID}" -eq "${MAX_HOST}" ]; then
+               let STOPID=${STOPID}+1
+       fi
 done


More information about the infrastructure mailing list