Re: [PATCH v5 7/8] mm: Only IPI CPUs to drain local pages if they exist

Mel Gorman <mel@xxxxxxxxx> · Mon, 9 Jan 2012 17:25:48 +0000

On Thu, Jan 05, 2012 at 03:19:19PM -0800, Andrew Morton wrote:
> On Thu, 5 Jan 2012 22:31:06 +0000
> Mel Gorman <mel@xxxxxxxxx> wrote:
> 
> > On Thu, Jan 05, 2012 at 02:06:45PM -0800, Andrew Morton wrote:
> > > On Thu, 5 Jan 2012 16:17:39 +0000
> > > Mel Gorman <mel@xxxxxxxxx> wrote:
> > > 
> > > > mm: page allocator: Guard against CPUs going offline while draining per-cpu page lists
> > > > 
> > > > While running a CPU hotplug stress test under memory pressure, I
> > > > saw cases where under enough stress the machine would halt although
> > > > it required a machine with 8 cores and plenty memory. I think the
> > > > problems may be related.
> > > 
> > > When we first implemented them, the percpu pages in the page allocator
> > > were of really really marginal benefit.  I didn't merge the patches at
> > > all for several cycles, and it was eventually a 49/51 decision.
> > > 
> > > So I suggest that our approach to solving this particular problem
> > > should be to nuke the whole thing, then see if that caused any
> > > observeable problems.  If it did, can we solve those problems by means
> > > other than bringing the dang things back?
> > > 
> > 
> > Sounds drastic.
> 
> Wrong thinking ;)
> 

:)

> Simplifying the code should always be the initial proposal.  Adding
> more complexity on top is the worst-case when-all-else-failed option. 
> Yet we so often reach for that option first :(
> 

Enngghh, I really want to agree with you but reducing lock contention
has been such an important goal for a long time that I am really loathe
to just rip it out and hope for the best.

> > It would be less controversial to replace this patch
> > with a version that calls get_online_cpu() in drain_all_pages() but
> > remove the call to drain_all_pages() call from the page allocator on
> > the grounds it is not safe against CPU hotplug and to hell with the
> > slightly elevated allocation failure rates and stalls. That would avoid
> > the try_get_online_cpus() crappiness and be less complex.
> 
> If we can come up with a reasonably simple patch which improves or even
> fixes the problem then I suppose there is some value in that, as it
> provides users of earlier kernels with something to backport if they
> hit problems.
> 

I'm preparing a patch that is a simplier fix but not sending an IPI at
all. There is also a sysfs fix that is necessary for tests to complete
successfully. The details will be in the series.

> But the social downside of that is that everyone would shuffle off
> towards other bright and shiny things and we'd be stuck with more
> complexity piled on top of dubiously beneficial code.
> 
> > If you really want to consider deleting the per-cpu allocator, maybe
> > it could be a LSF/MM topic?
> 
> eek, spare me.
> 

It was worth a shot.

> Anyway, we couldn't discuss such a topic without data.  Such data would
> be obtained by deleting the code and measuring the results.  Which is
> what I just said ;)
> 

Crap. ok. I've added a TODO list to implement a patch that removes it.
It is at a lower priority than removing lumpy reclaim though -
eventally this TODO list will start shrinking. I'll need to put
some thought into how it can be tested but even then I probably am
not the best person to test it. I don't have regular access to a 2+
socket machine to test NUMA effects for example.

> > Personally I would be wary of deleting
> > it but mostly because I lack regular access to the type of hardware
> > to evaulate whether it was safe to remove or not. Minimally, removing
> > the per-cpu allocator could make the zone lock very hot even though slub
> > probably makes it very hot already.
> 
> Much of the testing of the initial code was done on mbligh's weirdass
> NUMAq box: 32-way 386 NUMA which suffered really badly if there were
> contention issues.  And even on that box, the code was marginal.  So
> I'm hopeful that things will be similar on current machines.  Of
> course, it's possible that calling patterns have changed in ways which
> make the code more beneficial than it used to be.
> 

Core counts are also higher and some workloads might be more
allocator intensive than they used to be - netperf and network-related
allocations for socket receive might be a problem for example.

> But this all ties into my proposal yesterday to remove
> mm/swap.c:lru_*_pvecs.  Most or all of the heavy one-page-at-a-time
> code can pretty easily be converted to operate on batches of pages. 
>
> Folowing on from that, it should be pretty simple to extend the
> batching down into the page freeing.  Look at put_pages_list() and
> weep.  And stuff like free_hot_cold_page_list() which could easily free
> the pages directly whilebatching the locking.
> 
> Page freeing should be relatively straightforward.  Batching page
> allocation is hard in some cases (anonymous pagefaults).
> 

Page faulting would certainly be hard to batch but it would only be
really a big problem if they are intensive enough and on enough CPUs to
cause zone lock contention that was a problem.

> Please do note that the above suggestions are only needed if removing
> the pcp lists causes a problem!  It may not.
> 

True.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html