On Fri, Apr 16, 2010 at 3:12 PM, Nick Piggin <npiggin@xxxxxxx> wrote: > On Thu, Apr 15, 2010 at 09:33:08AM +0100, Steven Whitehouse wrote: >> Hi, >> >> On Thu, 2010-04-15 at 01:35 +0900, Minchan Kim wrote: >> > On Thu, 2010-04-15 at 00:13 +0900, Minchan Kim wrote: >> > > On Wed, Apr 14, 2010 at 9:49 PM, Steven Whitehouse <swhiteho@xxxxxxxxxx> wrote: >> > > >> When this module is run on my x86_64, 8 core, 12 Gb machine, then on an >> > > >> otherwise idle system I get the following results: >> > > >> >> > > >> vmalloc took 148798983 us >> > > >> vmalloc took 151664529 us >> > > >> vmalloc took 152416398 us >> > > >> vmalloc took 151837733 us >> > > >> >> > > >> After applying the two line patch (see the same bz) which disabled the >> > > >> delayed removal of the structures, which appears to be intended to >> > > >> improve performance in the smp case by reducing TLB flushes across cpus, >> > > >> I get the following results: >> > > >> >> > > >> vmalloc took 15363634 us >> > > >> vmalloc took 15358026 us >> > > >> vmalloc took 15240955 us >> > > >> vmalloc took 15402302 us >> > >> > >> > > >> >> > > >> So thats a speed up of around 10x, which isn't too bad. The question is >> > > >> whether it is possible to come to a compromise where it is possible to >> > > >> retain the benefits of the delayed TLB flushing code, but reduce the >> > > >> overhead for other users. My two line patch basically disables the delay >> > > >> by forcing a removal on each and every vfree. >> > > >> >> > > >> What is the correct way to fix this I wonder? >> > > >> >> > > >> Steve. >> > > >> >> > >> > In my case(2 core, mem 2G system), 50300661 vs 11569357. >> > It improves 4 times. >> > >> Looking at the code, it seems that the limit, against which my patch >> removes a test, scales according to the number of cpu cores. So with >> more cores, I'd expect the difference to be greater. I have a feeling >> that the original reporter had a greater number than the 8 of my test >> machine. >> >> > It would result from larger number of lazy_max_pages. >> > It would prevent many vmap_area freed. >> > So alloc_vmap_area takes long time to find new vmap_area. (ie, lookup >> > rbtree) >> > >> > How about calling purge_vmap_area_lazy at the middle of loop in >> > alloc_vmap_area if rbtree lookup were long? >> > >> That may be a good solution - I'm happy to test any patches but my worry >> is that any change here might result in a regression in whatever >> workload the lazy purge code was originally designed to improve. Is >> there any way to test that I wonder? > > Ah this is interesting. What we could do is have a "free area cache" > like the user virtual memory allocator has, which basically avoids > restarting the search from scratch. > > Or we could perhaps go one better and do a more sophisticated free space > allocator. AFAIR, vmalloc's performance regression is first. I am not sure whoever suffers from it and didn't report. Anyway, with fist report, complicated allocator implement is rather overkill, I think. So I votes free_area_cache. Early ending of lookup from last cache point makes overflow fast and it results in flush. I think it's good in that it doesn't depends on system resource environment. And it could improve search time than one from scratch unless it's very unfortunate. > > Bigger systems will indeed get hurt by increasing flushes so I'd prefer > to avoid that. But that's not a good justification for a slowdown for > small systems. What good is having cake if you can't also eat it? :) Indeed. :) -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>