Re: [PATCH v5 7/8] mm: Only IPI CPUs to drain local pages if they exist

Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> · Thu, 5 Jan 2012 15:19:19 -0800

On Thu, 5 Jan 2012 22:31:06 +0000
Mel Gorman <mel@xxxxxxxxx> wrote:

> On Thu, Jan 05, 2012 at 02:06:45PM -0800, Andrew Morton wrote:
> > On Thu, 5 Jan 2012 16:17:39 +0000
> > Mel Gorman <mel@xxxxxxxxx> wrote:
> > 
> > > mm: page allocator: Guard against CPUs going offline while draining per-cpu page lists
> > > 
> > > While running a CPU hotplug stress test under memory pressure, I
> > > saw cases where under enough stress the machine would halt although
> > > it required a machine with 8 cores and plenty memory. I think the
> > > problems may be related.
> > 
> > When we first implemented them, the percpu pages in the page allocator
> > were of really really marginal benefit.  I didn't merge the patches at
> > all for several cycles, and it was eventually a 49/51 decision.
> > 
> > So I suggest that our approach to solving this particular problem
> > should be to nuke the whole thing, then see if that caused any
> > observeable problems.  If it did, can we solve those problems by means
> > other than bringing the dang things back?
> > 
> 
> Sounds drastic.

Wrong thinking ;)

Simplifying the code should always be the initial proposal.  Adding
more complexity on top is the worst-case when-all-else-failed option. 
Yet we so often reach for that option first :(

> It would be less controversial to replace this patch
> with a version that calls get_online_cpu() in drain_all_pages() but
> remove the call to drain_all_pages() call from the page allocator on
> the grounds it is not safe against CPU hotplug and to hell with the
> slightly elevated allocation failure rates and stalls. That would avoid
> the try_get_online_cpus() crappiness and be less complex.

If we can come up with a reasonably simple patch which improves or even
fixes the problem then I suppose there is some value in that, as it
provides users of earlier kernels with something to backport if they
hit problems.

But the social downside of that is that everyone would shuffle off
towards other bright and shiny things and we'd be stuck with more
complexity piled on top of dubiously beneficial code.

> If you really want to consider deleting the per-cpu allocator, maybe
> it could be a LSF/MM topic?

eek, spare me.

Anyway, we couldn't discuss such a topic without data.  Such data would
be obtained by deleting the code and measuring the results.  Which is
what I just said ;)

> Personally I would be wary of deleting
> it but mostly because I lack regular access to the type of hardware
> to evaulate whether it was safe to remove or not. Minimally, removing
> the per-cpu allocator could make the zone lock very hot even though slub
> probably makes it very hot already.

Much of the testing of the initial code was done on mbligh's weirdass
NUMAq box: 32-way 386 NUMA which suffered really badly if there were
contention issues.  And even on that box, the code was marginal.  So
I'm hopeful that things will be similar on current machines.  Of
course, it's possible that calling patterns have changed in ways which
make the code more beneficial than it used to be.

But this all ties into my proposal yesterday to remove
mm/swap.c:lru_*_pvecs.  Most or all of the heavy one-page-at-a-time
code can pretty easily be converted to operate on batches of pages. 
Folowing on from that, it should be pretty simple to extend the
batching down into the page freeing.  Look at put_pages_list() and
weep.  And stuff like free_hot_cold_page_list() which could easily free
the pages directly whilebatching the locking.

Page freeing should be relatively straightforward.  Batching page
allocation is hard in some cases (anonymous pagefaults).

Please do note that the above suggestions are only needed if removing
the pcp lists causes a problem!  It may not.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html