MM2 status

Mon Mar 23 19:31:58 UTC 2015

On Mon, Mar 23, 2015 at 10:11:56AM -0600, Stephen John Smoogen wrote:
> On 23 March 2015 at 09:59, Adrian Reber <adrian at lisas.de> wrote:
> > > Additionally the 4GB of RAM on mm-crawler01 are not enough to
> > > crawl all the mirrors in a reasonable time. Even if only
> > > started with 20 crawler threads instead of 75 the 4GB are not
> > > enough.
> >
> > This has been increased to 32GB (thanks) and I had a few test runs of the
> > crawler
> > over the weekend with libcurl from F21:
> >
> > All runs for 435 mirrors take at least 6 hours:
> >
> > 50 threads:
> >
> > http://lisas.de/~adrian/crawler-resources/2015-03-21-19-51-44-crawler-resources.pdf
> >
> > 50 threads with explicit garbage collection:
> >
> > http://lisas.de/~adrian/crawler-resources/2015-03-22-06-18-30-crawler-resources.pdf
> >
> > 75 threads:
> >
> > http://lisas.de/~adrian/crawler-resources/2015-03-22-13-02-37-crawler-resources.pdf
> >
> > 75 threads with explicitly setting variables to None at the end:
> >
> > http://lisas.de/~adrian/crawler-resources/2015-03-23-07-46-19-crawler-resources.pdf
> >
> > Manually triggering the garbage collector makes almost no difference (if
> > any at all). The crawler takes huge amount of memories and a really long
> > time.
> >
> > As much as I like the new threaded design I am not 100% convinced it the
> > best solution when looking at the memory requirements. Somewhere memory
> > must be leaking.
> >
> > The next changes I will do is to sort the mirrors descending by the
> > crawl duration to make sure the longest runnings crawls are started as
> > early as possible (this was implemented in MM1). I will then try to
> > start with 100 threads to see how long it takes and how much memory is
> > required.

100 threads is too much with 32GB. This OOM'd and was killed.

> I would think that increasing threads would get bogged down by either
> network access or cpus. Since we aren't seeing more than 130% usage of
> CPU.. I am guessing it is bogging down to network access (eg it can only
> poll so many networks per second per interface and they can only return so
> quickly on that one interface). Do you think that having 2 or more crawler
> systems might do better?

I was hoping to implement 2 more crawlers in the end. With a simple
setup it is possible to distribute the crawling to more machines. We
know how many mirror hosts we have and the crawler can be given a host
start and stop id. This distribution will not be perfect as this does
not take into account that mirrors might be inactive/disabled/private
but for a simple setup to distribute the load it should be good enough.

		Adrian
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 811 bytes
Desc: not available
URL: <http://lists.fedoraproject.org/pipermail/infrastructure/attachments/20150323/ca0847a3/attachment-0001.sig>