On Thu, Jan 05, 2012 at 03:19:19PM -0800, Andrew Morton wrote: > On Thu, 5 Jan 2012 22:31:06 +0000 > Mel Gorman <mel@xxxxxxxxx> wrote: > > > On Thu, Jan 05, 2012 at 02:06:45PM -0800, Andrew Morton wrote: > > > On Thu, 5 Jan 2012 16:17:39 +0000 > > > Mel Gorman <mel@xxxxxxxxx> wrote: > > > > > > > mm: page allocator: Guard against CPUs going offline while draining per-cpu page lists > > > > > > > > While running a CPU hotplug stress test under memory pressure, I > > > > saw cases where under enough stress the machine would halt although > > > > it required a machine with 8 cores and plenty memory. I think the > > > > problems may be related. > > > > > > When we first implemented them, the percpu pages in the page allocator > > > were of really really marginal benefit. I didn't merge the patches at > > > all for several cycles, and it was eventually a 49/51 decision. > > > > > > So I suggest that our approach to solving this particular problem > > > should be to nuke the whole thing, then see if that caused any > > > observeable problems. If it did, can we solve those problems by means > > > other than bringing the dang things back? > > > > > > > Sounds drastic. > > Wrong thinking ;) > :) > Simplifying the code should always be the initial proposal. Adding > more complexity on top is the worst-case when-all-else-failed option. > Yet we so often reach for that option first :( > Enngghh, I really want to agree with you but reducing lock contention has been such an important goal for a long time that I am really loathe to just rip it out and hope for the best. > > It would be less controversial to replace this patch > > with a version that calls get_online_cpu() in drain_all_pages() but > > remove the call to drain_all_pages() call from the page allocator on > > the grounds it is not safe against CPU hotplug and to hell with the > > slightly elevated allocation failure rates and stalls. That would avoid > > the try_get_online_cpus() crappiness and be less complex. > > If we can come up with a reasonably simple patch which improves or even > fixes the problem then I suppose there is some value in that, as it > provides users of earlier kernels with something to backport if they > hit problems. > I'm preparing a patch that is a simplier fix but not sending an IPI at all. There is also a sysfs fix that is necessary for tests to complete successfully. The details will be in the series. > But the social downside of that is that everyone would shuffle off > towards other bright and shiny things and we'd be stuck with more > complexity piled on top of dubiously beneficial code. > > > If you really want to consider deleting the per-cpu allocator, maybe > > it could be a LSF/MM topic? > > eek, spare me. > It was worth a shot. > Anyway, we couldn't discuss such a topic without data. Such data would > be obtained by deleting the code and measuring the results. Which is > what I just said ;) > Crap. ok. I've added a TODO list to implement a patch that removes it. It is at a lower priority than removing lumpy reclaim though - eventally this TODO list will start shrinking. I'll need to put some thought into how it can be tested but even then I probably am not the best person to test it. I don't have regular access to a 2+ socket machine to test NUMA effects for example. > > Personally I would be wary of deleting > > it but mostly because I lack regular access to the type of hardware > > to evaulate whether it was safe to remove or not. Minimally, removing > > the per-cpu allocator could make the zone lock very hot even though slub > > probably makes it very hot already. > > Much of the testing of the initial code was done on mbligh's weirdass > NUMAq box: 32-way 386 NUMA which suffered really badly if there were > contention issues. And even on that box, the code was marginal. So > I'm hopeful that things will be similar on current machines. Of > course, it's possible that calling patterns have changed in ways which > make the code more beneficial than it used to be. > Core counts are also higher and some workloads might be more allocator intensive than they used to be - netperf and network-related allocations for socket receive might be a problem for example. > But this all ties into my proposal yesterday to remove > mm/swap.c:lru_*_pvecs. Most or all of the heavy one-page-at-a-time > code can pretty easily be converted to operate on batches of pages. > > Folowing on from that, it should be pretty simple to extend the > batching down into the page freeing. Look at put_pages_list() and > weep. And stuff like free_hot_cold_page_list() which could easily free > the pages directly whilebatching the locking. > > Page freeing should be relatively straightforward. Batching page > allocation is hard in some cases (anonymous pagefaults). > Page faulting would certainly be hard to batch but it would only be really a big problem if they are intensive enough and on enough CPUs to cause zone lock contention that was a problem. > Please do note that the above suggestions are only needed if removing > the pcp lists causes a problem! It may not. > True. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html