On Thu, 5 Jan 2012 22:31:06 +0000 Mel Gorman <mel@xxxxxxxxx> wrote: > On Thu, Jan 05, 2012 at 02:06:45PM -0800, Andrew Morton wrote: > > On Thu, 5 Jan 2012 16:17:39 +0000 > > Mel Gorman <mel@xxxxxxxxx> wrote: > > > > > mm: page allocator: Guard against CPUs going offline while draining per-cpu page lists > > > > > > While running a CPU hotplug stress test under memory pressure, I > > > saw cases where under enough stress the machine would halt although > > > it required a machine with 8 cores and plenty memory. I think the > > > problems may be related. > > > > When we first implemented them, the percpu pages in the page allocator > > were of really really marginal benefit. I didn't merge the patches at > > all for several cycles, and it was eventually a 49/51 decision. > > > > So I suggest that our approach to solving this particular problem > > should be to nuke the whole thing, then see if that caused any > > observeable problems. If it did, can we solve those problems by means > > other than bringing the dang things back? > > > > Sounds drastic. Wrong thinking ;) Simplifying the code should always be the initial proposal. Adding more complexity on top is the worst-case when-all-else-failed option. Yet we so often reach for that option first :( > It would be less controversial to replace this patch > with a version that calls get_online_cpu() in drain_all_pages() but > remove the call to drain_all_pages() call from the page allocator on > the grounds it is not safe against CPU hotplug and to hell with the > slightly elevated allocation failure rates and stalls. That would avoid > the try_get_online_cpus() crappiness and be less complex. If we can come up with a reasonably simple patch which improves or even fixes the problem then I suppose there is some value in that, as it provides users of earlier kernels with something to backport if they hit problems. But the social downside of that is that everyone would shuffle off towards other bright and shiny things and we'd be stuck with more complexity piled on top of dubiously beneficial code. > If you really want to consider deleting the per-cpu allocator, maybe > it could be a LSF/MM topic? eek, spare me. Anyway, we couldn't discuss such a topic without data. Such data would be obtained by deleting the code and measuring the results. Which is what I just said ;) > Personally I would be wary of deleting > it but mostly because I lack regular access to the type of hardware > to evaulate whether it was safe to remove or not. Minimally, removing > the per-cpu allocator could make the zone lock very hot even though slub > probably makes it very hot already. Much of the testing of the initial code was done on mbligh's weirdass NUMAq box: 32-way 386 NUMA which suffered really badly if there were contention issues. And even on that box, the code was marginal. So I'm hopeful that things will be similar on current machines. Of course, it's possible that calling patterns have changed in ways which make the code more beneficial than it used to be. But this all ties into my proposal yesterday to remove mm/swap.c:lru_*_pvecs. Most or all of the heavy one-page-at-a-time code can pretty easily be converted to operate on batches of pages. Folowing on from that, it should be pretty simple to extend the batching down into the page freeing. Look at put_pages_list() and weep. And stuff like free_hot_cold_page_list() which could easily free the pages directly whilebatching the locking. Page freeing should be relatively straightforward. Batching page allocation is hard in some cases (anonymous pagefaults). Please do note that the above suggestions are only needed if removing the pcp lists causes a problem! It may not. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>