On Mon, Mar 23, 2015 at 10:11:56AM -0600, Stephen John Smoogen wrote: > On 23 March 2015 at 09:59, Adrian Reber <adrian@xxxxxxxx> wrote: > > > Additionally the 4GB of RAM on mm-crawler01 are not enough to > > > crawl all the mirrors in a reasonable time. Even if only > > > started with 20 crawler threads instead of 75 the 4GB are not > > > enough. > > > > This has been increased to 32GB (thanks) and I had a few test runs of the > > crawler > > over the weekend with libcurl from F21: > > > > All runs for 435 mirrors take at least 6 hours: > > > > 50 threads: > > > > http://lisas.de/~adrian/crawler-resources/2015-03-21-19-51-44-crawler-resources.pdf > > > > 50 threads with explicit garbage collection: > > > > http://lisas.de/~adrian/crawler-resources/2015-03-22-06-18-30-crawler-resources.pdf > > > > 75 threads: > > > > http://lisas.de/~adrian/crawler-resources/2015-03-22-13-02-37-crawler-resources.pdf > > > > 75 threads with explicitly setting variables to None at the end: > > > > http://lisas.de/~adrian/crawler-resources/2015-03-23-07-46-19-crawler-resources.pdf > > > > Manually triggering the garbage collector makes almost no difference (if > > any at all). The crawler takes huge amount of memories and a really long > > time. > > > > As much as I like the new threaded design I am not 100% convinced it the > > best solution when looking at the memory requirements. Somewhere memory > > must be leaking. > > > > The next changes I will do is to sort the mirrors descending by the > > crawl duration to make sure the longest runnings crawls are started as > > early as possible (this was implemented in MM1). I will then try to > > start with 100 threads to see how long it takes and how much memory is > > required. 100 threads is too much with 32GB. This OOM'd and was killed. > I would think that increasing threads would get bogged down by either > network access or cpus. Since we aren't seeing more than 130% usage of > CPU.. I am guessing it is bogging down to network access (eg it can only > poll so many networks per second per interface and they can only return so > quickly on that one interface). Do you think that having 2 or more crawler > systems might do better? I was hoping to implement 2 more crawlers in the end. With a simple setup it is possible to distribute the crawling to more machines. We know how many mirror hosts we have and the crawler can be given a host start and stop id. This distribution will not be perfect as this does not take into account that mirrors might be inactive/disabled/private but for a simple setup to distribute the load it should be good enough. Adrian
Attachment:
pgpp1Od2gOazW.pgp
Description: PGP signature
_______________________________________________ infrastructure mailing list infrastructure@xxxxxxxxxxxxxxxxxxxxxxx https://admin.fedoraproject.org/mailman/listinfo/infrastructure