On Fri, Mar 20, 2015 at 04:38:24PM +0100, Adrian Reber wrote:
> The biggest MM2 problem which currently exists is that the crawler
> segfaults when running with more than 10 or 12 threads. The current
> configuration runs daily with 75 threads and crashes regularly:
>
> [740512.481002] mm2_crawler[18149]: segfault at 30 ip 00007ffdd8201557 sp 00007ffd787d5250 error 4 in libcurl.so.4.3.0[7ffdd81d8000+63000]
> [783445.620762] mm2_crawler[20500]: segfault at 30 ip 00007f87477ff557 sp 00007f86e7fd4250 error 4 in libcurl.so.4.3.0[7f87477d6000+63000]
> [826619.130431] mm2_crawler[24376]: segfault at 30 ip 00007f7cee7ac557 sp 00007f7c8cfde250 error 4 in libcurl.so.4.3.0[7f7cee783000+63000]
> [869846.873962] mm2_crawler[27771]: segfault at 30 ip 00007ffd3bc07557 sp 00007ffd11ff8250 error 4 in libcurl.so.4.3.0[7ffd3bbde000+63000]
>
> By preloading libcurl from F21 on the command-line
> seems to make the segfault go away. So somewhere between
> curl 7.29 (RHEL 7.1) and curl 7.37 (F21) something was fixed
> which would be needed on RHEL7.1 before switching to MM2.
This is discussed in https://bugzilla.redhat.com/show_bug.cgi?id=1204825
> Additionally the 4GB of RAM on mm-crawler01 are not enough to
> crawl all the mirrors in a reasonable time. Even if only
> started with 20 crawler threads instead of 75 the 4GB are not
> enough.
This has been increased to 32GB (thanks) and I had a few test runs of the crawler
over the weekend with libcurl from F21:
All runs for 435 mirrors take at least 6 hours:
50 threads:
http://lisas.de/~adrian/crawler-resources/2015-03-21-19-51-44-crawler-resources.pdf
50 threads with explicit garbage collection:
http://lisas.de/~adrian/crawler-resources/2015-03-22-06-18-30-crawler-resources.pdf
75 threads:
http://lisas.de/~adrian/crawler-resources/2015-03-22-13-02-37-crawler-resources.pdf
75 threads with explicitly setting variables to None at the end:
http://lisas.de/~adrian/crawler-resources/2015-03-23-07-46-19-crawler-resources.pdf
Manually triggering the garbage collector makes almost no difference (if
any at all). The crawler takes huge amount of memories and a really long
time.
As much as I like the new threaded design I am not 100% convinced it the
best solution when looking at the memory requirements. Somewhere memory
must be leaking.
The next changes I will do is to sort the mirrors descending by the
crawl duration to make sure the longest runnings crawls are started as
early as possible (this was implemented in MM1). I will then try to
start with 100 threads to see how long it takes and how much memory is
required.