Re: Potential race in TLB flush batching?

Mel Gorman <mgorman@xxxxxxx> · Fri, 14 Jul 2017 09:31:14 +0100

On Fri, Jul 14, 2017 at 05:00:41PM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2017-07-11 at 15:07 -0700, Andy Lutomirski wrote:
> > On Tue, Jul 11, 2017 at 12:18 PM, Mel Gorman <mgorman@xxxxxxx> wrote:
> > 
> > I would change this slightly:
> > 
> > > +void flush_tlb_batched_pending(struct mm_struct *mm)
> > > +{
> > > +       if (mm->tlb_flush_batched) {
> > > +               flush_tlb_mm(mm);
> > 
> > How about making this a new helper arch_tlbbatch_flush_one_mm(mm);
> > The idea is that this could be implemented as flush_tlb_mm(mm), but
> > the actual semantics needed are weaker.  All that's really needed
> > AFAICS is to make sure that any arch_tlbbatch_add_mm() calls on this
> > mm that have already happened become effective by the time that
> > arch_tlbbatch_flush_one_mm() returns.
> 
> Jumping in ... I just discovered that 'new' batching stuff... is it
> documented anywhere ?
> 

This should be a new thread.

The original commit log has many of the details and the comments have
others. It's clearer what the boundaries are and what is needed from an
architecture with Andy's work on top which right now is easier to see
from tip/x86/mm

> We already had some form of batching via the mmu_gather, now there's a
> different somewhat orthogonal and it's completely unclear what it's
> about and why we couldn't use what we already had. Also what
> assumptions it makes if I want to port it to my arch....
> 

The batching in this context is more about mm's than individual pages
and was done this was as the number of mm's to track was potentially
unbound. At the time of implementation, tracking individual pages and the
extra bits for mmu_gather was overkill and fairly complex due to the need
to potentiall restart when the gather structure filled.

It may also be only a gain on a limited number of architectures depending
on exactly how an architecture handles flushing. At the time, batching
this for x86 in the worse-case scenario where all pages being reclaimed
were mapped from multiple threads knocked 24.4% off elapsed run time and
29% off system CPU but only on multi-socket NUMA machines. On UMA, it was
barely noticable. For some workloads where only a few pages are mapped or
the mapped pages on the LRU are relatively sparese, it'll make no difference.

The worst-case situation is extremely IPI intensive on x86 where many
IPIs were being sent for each unmap. It's only worth even considering if
you see that the time spent sending IPIs for flushes is a large portion
of reclaim.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>