Re: [PATCH v1 for-next 06/16] IB/core: Implement support for MMU notifiers regarding on demand paging regions

Jerome Glisse <j.glisse@xxxxxxxxx> · Tue, 9 Sep 2014 11:36:27 -0400

On Sun, Sep 07, 2014 at 02:35:59PM +0000, Shachar Raindel wrote:
> Hi,
> 
> > -----Original Message-----
> > From: Jerome Glisse [mailto:j.glisse@xxxxxxxxx]
> > Sent: Thursday, September 04, 2014 11:25 PM
> > To: Haggai Eran; linux-rdma@xxxxxxxxxxxxxxx
> > Cc: Shachar Raindel; Sagi Grimberg
> > Subject: Re: [PATCH v1 for-next 06/16] IB/core: Implement support for
> > MMU notifiers regarding on demand paging regions
> > 
> > > * Add an interval tree implementation for ODP umems. Create an
> > interval tree
> > >   for each ucontext (including a count of the number of ODP MRs in
> > this
> > >   context, mutex, etc.), and register ODP umems in the interval tree.
> > > * Add MMU notifiers handling functions, using the interval tree to
> > notify only
> > >   the relevant umems and underlying MRs.
> > > * Register to receive MMU notifier events from the MM subsystem upon
> > ODP MR
> > >   registration (and unregister accordingly).
> > > * Add a completion object to synchronize the destruction of ODP umems.
> > > * Add mechanism to abort page faults when there's a concurrent
> > invalidation.
> > >
> > > The way we synchronize between concurrent invalidations and page
> > faults is by
> > > keeping a counter of currently running invalidations, and a sequence
> > number
> > > that is incremented whenever an invalidation is caught. The page fault
> > code
> > > checks the counter and also verifies that the sequence number hasn't
> > > progressed before it updates the umem's page tables. This is similar
> > to what
> > > the kvm module does.
> > >
> > > There's currently a rare race in the code when registering a umem in
> > the
> > > middle of an ongoing notifier. The proper fix is to either serialize
> > the
> > > insertion to our umem tree with mm_lock_all or use a ucontext wide
> > running
> > > notifiers count for retries decision. Either is ugly and can lead to
> > some sort
> > > of starvation. The current workaround is ugly as well - now the user
> > can end
> > > up with mapped addresses that are not in the user's address space
> > (although it
> > > is highly unlikely).
> > 
> > I have been trying to wrap my head around this comment. I am totaly
> > unfamiliar
> > with RDMA code, but from quick look at it when registering umem you take
> > the
> > mmap_sem in read mode so any munmap from userspace would be serialize.
> > Really
> > the worst that can happen is that a umem pointing to a mmaped file that
> > is
> > concurently truncated but even then the address is still valid, but it
> > should
> > result in a SIGBUS which here is obviously harder to report (again dunno
> > how
> > RDMA works).
> > 
> > So am i missing something ?
> > 
> 
> Sadly, taking mmap_sem in read-only mode does not prevent all possible invalidations from happening.
> For example, a call to madvise requesting MADVISE_DONTNEED will lock the mmap_sem for reading only, allowing a notifier to run in parallel to the MR registration As a result, the following sequence of events could happen:
> 
> Thread 1:                       |   Thread 2
> --------------------------------+-------------------------
> madvise                         |
> down_read(mmap_sem)             |
> notifier_start                  |
>                                 |   down_read(mmap_sem)
>                                 |   register_mr
> notifier_end                    |
> reduce_mr_notifiers_count       |
> 
> The end result of this sequence is an mr with running notifiers count of -1, which is bad.
> The current workaround is to avoid decreasing the notifiers count if it is zero, which can cause other issues.
> The proper fix would be to prevent notifiers from running in parallel to registration. For this, taking mmap_sem in write mode might be sufficient, but we are not sure about this.
> We will be happy to hear additional input on this subject, to make sure we got it covered properly.

So in HMM i solve this by having a struct allocated in the start range callback
and the end range callback just ignore things when it can not find the matching
struct.

That being said when registering the mmu_notifier you need 2 things, first you
need a pin on the mm (either mm is current ie current->mm or you took a reference
on it). Second you need to that the mmap smemaphore in write mode so that
no concurrent mmap/munmap/madvise can happen. By doing that you protect yourself
from concurrent range_start/range_end that can happen and that does matter.
The only concurrent range_start/end that can happen is through file invalidation
which is fine because subsequent page fault will go through the file layer and
bring back page or return error (if file was truncated for instance).

So as long as you hold the mmap_sem in write mode you should not worry about
concurrent range_start/range_end (well they might happen but only for file
backed vma).

Given that you face the same issue as i have with the range_start/range_end i
will stich up a patch to make it easier to track those.

Cheers,
Jérôme

> 
> Thanks,
> --Shachar
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html