On Sat, Apr 14, 2018 at 12:37:24AM +0000, Stephen John Smoogen wrote:
On Fri, Apr 13, 2018 at 11:14 AM Adrian Reber <adrian(a)lisas.de>
wrote:
> I would like to change the setup of our mirror crawler and just wanted
> to mention my planned changes here before working on them.
>
> Currently we have two VMs which are crawling our mirrors. Each of the
> machine is responsible for one half of the active mirrors. The crawl
> starts every 12 hours on the first crawler and 6 hours later on the
> second crawler. So every 6 hours one crawler is accessing the database.
>
> Currently most of the crawling time is not spent crawling but updating
> the database about which host has which directory up to date. With a
> timeout of 4 hours per host we are hitting that timeout on some hosts
> regularly and most of the time the database access is the problem.
>
> What I would like to change is to crawl each category (Fedora Linux,
> Fedora Other, Fedora EPEL, Fedora Secondary Arches, Fedora Archive)
> separately and at different times and intervals.
>
> We would not hit the timeout as often as now as only the information for
> a single category has to be updated. We could scan 'Fedora Archive' only
> once per day or every second day. We can scan 'Fedora EPEL' much more
> often as it is usually really fast and get better data about the
> available mirrors.
>
> My goal would be to distribute the scanning in such a way to decrease
> the load on the database and to decrease the cases of mirror
> auto-deactivation due to slow database accesses.
>
> Let me know if you think that these planned changes are the wrong
> direction of if you have other ideas how to improve the mirror crawling.
These look like a good way to deal with the fact that we have a lot of data
and files and mirrors nd users get confused about how up to date they are.
Would more VM’s help spread this out also?
From my point of view the main problem is the load MirrorManager creates
on the database. Currently I do not think that more VMs would help the
crawling. Someone once mentioned a dedicated database VM for
MirrorManager. That is something which could make a difference, but
first I would like to see if crawling per category can improve the
situation.
Adrian