Re: Strange mmap_lock tracepoint

Liam Howlett <liam.howlett@xxxxxxxxxx> · Fri, 3 Sep 2021 01:57:34 +0000

* Liam Howlett <liam.howlett@xxxxxxxxxx> [210901 13:05]:
> * Vlastimil Babka <vbabka@xxxxxxx> [210901 06:52]:
> > On 8/31/21 22:40, Liam Howlett wrote:
> > > 
> > > Hello,
> > > 
> > > I've been trying to trace the mmap_lock calls using tracepoints and
> > > captured this behaviour which I cannot explain.  This was with the maple
> > > tree v2 patches running ebizzy with multiple threads in an x86_64 KVM
> > > using 8 virtual CPUs.
> > > 
> > > AFAICT, there are zero callers that use the mmap_lock directly besides a
> > > prefetchw(&mm->mmap_lock);
> > > 
> > > 
> > >           ebizzy-803     [000] ....  5376.655366: lock_release: 00000000de8cf25e lock
> > >           ebizzy-803     [000] ....  5376.655366: lock_release: 00000000b5e38448 ptlock_ptr(page)
> > >           ebizzy-803     [000] ....  5376.655367: lock_release: 000000006b581afd &mm->mmap_lock
> > >           ebizzy-803     [000] ....  5376.655367: mmap_lock_released: mm=000000001de7b122 memcg_path= write=false
> > >  
> > >           ebizzy-803     [000] ....  5376.655369: mmap_lock_start_locking: mm=000000001de7b122 memcg_path= write=false
> > >  
> > >           ebizzy-803     [000] ....  5376.655369: mmap_lock_acquire_returned: mm=000000001de7b122 memcg_path= write=false success=false
> > >  
> > >           ebizzy-803     [000] ....  5376.655369: mmap_lock_start_locking: mm=000000001de7b122 memcg_path= write=false
> > 
> > I was thinking khugepaged interference, but this seems to be a system-wide
> > tracing (judging from <idle>-0) so you would see khugepaged too, right?
> 
> Correct.  All calls will go through Michel's tracepoints from what I can
> tell.
> 
> > 
> > In the other hand it seems strange to have a long list of just cpu 0 here.
> > Are other CPU's missing or just the interleaving is imperfect because
> > timestamps are not perfectly in sync between cpus, and in fact there was
> > another CPU who took the lock?
> 
> Yes, it could be the clock.  I had used the default clock which is local
> to CPUs.  I was looking for documentation in 'time stamp' but should
> have looked for 'clock'.  I've re-run the test with counter, which
> should remove the potential of incorrect ordering.  I've also imported
> into a spreadsheet and sorted by the counter to ensure there isn't
> interleaving of printing causing issues.  This is why the output below
> has a slightly different whitespace than the original.
> 

<dropping old log since it's not necessary>

> 
> Please note that this is now next-20210811 directly.  No maple tree
> patches.
> 
> 
> As per the grouping of each thread, I don't really have an answer.
> There are places in the trace which do have different CPUs and threads
> running interleaved:
> 
> ebizzy-1445    [000] ....	83693711	: lock_release: 0000000055bc357b &mm->mmap_lock
> ebizzy-1445    [000] ....	83693715	: mmap_lock_released: mm=00000000ce5bd903 memcg_path= write=false
> ebizzy-1437    [003] ....	83693727	: mmap_lock_start_locking: mm=00000000ce5bd903 memcg_path= write=false
> ebizzy-1437    [003] ....	83693730	: lock_acquire: 0000000055bc357b try read &mm->mmap_lock
> ebizzy-1437    [003] ....	83693733	: mmap_lock_acquire_returned: mm=00000000ce5bd903 memcg_path= write=false success=true
> ebizzy-1437    [003] ....	83693738	: lock_acquire: 000000005ada5d35 fs_reclaim
> ebizzy-1437    [003] ....	83693742	: lock_acquire: 000000009ff6ca04 mmu_notifier_invalidate_range_start
> ebizzy-1445    [000] ....	83693743	: mmap_lock_start_locking: mm=00000000ce5bd903 memcg_path= write=false
> ebizzy-1437    [003] ....	83693747	: lock_release: 000000009ff6ca04 mmu_notifier_invalidate_range_start
> 
> 
> >  <removed old log>
> > > 
> > > The following tracepoints were enabled:
> > >   events/mmap_lock/enable
> > >   events/lock/enable
> > > 
> > > My reading of the above trace is that the ebizzy thread dropped the lock
> > > and immediately attempted to reacquire and detected it was in contention
> > > so the thread added itself to the list and went to sleep only to wake up
> > > and get the lock?
> > 
> > Which path does this mmap_read_trylock() followed by immediate
> > mmap_read_lock() anyway? I mean down_read() is implemented like this
> > internally, but that wouldn't generate these mmap_lock_ events.
> 
> I think it's do_user_addr_fault()?
> 
> > 
> > > Doesn't that mean the mmap_lock is _not_ contended?  The lack of
> > > contention makes sense since there is no tracepoint of an attempt to
> > > acquire the lock between the dropping of the lock and the same MM waking
> > > up to get the lock.
> > > 
> > > What is even stranger is if I change the value of _Q_PENDING_LOOPS from
> > > (1 << 9) to (1 << 10), the benchmark performs better.
> > > 
> > > From the above trace and the effects of the _Q_PENDING_LOOPS change, it
> > > looks like the lock is somehow remaining in a state that causes a
> > > failure to acquire the lock even when it is not held or contended?
> 
> So the new log is with trace_clock set to counter with next-20210811
> running ./ebizzy -m (only use mmap) on a KVM with 8 CPUs.

I was able to get to the bottom what what was happening with a lot of
help from Steven Rostedt.  The logs can be explained by the following:

- Sequence counters may be missed if a given CPU overruns its ring
  buffer.  These are marked by #### CPU N buffer started #### or
  something similar.

- There may also be a jump in the sequence counter by interrupts.

- The mmap_lock logging was racy with other threads logging of
  mmap_lock.  I have sent out a patch [1] to fix the race.

- When downgrading the writer, the log will show that the lock is
  acquired as a read lock (write=false) without a 'start locking' log.

1.  https://lore.kernel.org/linux-mm/20210903013521.1802774-1-Liam.Howlett@xxxxxxxxxx/

Thanks everyone for the help in tracking down the odd behaviour.  It was
nice to find out it wasn't entirely user error :)

Cheers,
Liam