Re: [PATCH v1 for-next 06/16] IB/core: Implement support for MMU notifiers regarding on demand paging regions

Jerome Glisse <j.glisse@xxxxxxxxx> · Wed, 10 Sep 2014 16:14:34 -0400

On Wed, Sep 10, 2014 at 09:00:36AM +0000, Shachar Raindel wrote:
> 
> 
> > -----Original Message-----
> > From: Jerome Glisse [mailto:j.glisse@xxxxxxxxx]
> > Sent: Tuesday, September 09, 2014 6:37 PM
> > To: Shachar Raindel
> > Cc: 1404377069-20585-1-git-send-email-haggaie@xxxxxxxxxxxx; Haggai Eran;
> > linux-rdma@xxxxxxxxxxxxxxx; Jerome Glisse; Sagi Grimberg
> > Subject: Re: [PATCH v1 for-next 06/16] IB/core: Implement support for
> > MMU notifiers regarding on demand paging regions
> > 
> > On Sun, Sep 07, 2014 at 02:35:59PM +0000, Shachar Raindel wrote:
> > > Hi,
> > >
> > > > -----Original Message-----
> > > > From: Jerome Glisse [mailto:j.glisse@xxxxxxxxx]
> > > > Sent: Thursday, September 04, 2014 11:25 PM
> > > > To: Haggai Eran; linux-rdma@xxxxxxxxxxxxxxx
> > > > Cc: Shachar Raindel; Sagi Grimberg
> > > > Subject: Re: [PATCH v1 for-next 06/16] IB/core: Implement support
> > for
> > > > MMU notifiers regarding on demand paging regions
> > > >
> > > > > * Add an interval tree implementation for ODP umems. Create an
> > > > interval tree
> > > > >   for each ucontext (including a count of the number of ODP MRs in
> > > > this
> > > > >   context, mutex, etc.), and register ODP umems in the interval
> > tree.
> > > > > * Add MMU notifiers handling functions, using the interval tree to
> > > > notify only
> > > > >   the relevant umems and underlying MRs.
> > > > > * Register to receive MMU notifier events from the MM subsystem
> > upon
> > > > ODP MR
> > > > >   registration (and unregister accordingly).
> > > > > * Add a completion object to synchronize the destruction of ODP
> > umems.
> > > > > * Add mechanism to abort page faults when there's a concurrent
> > > > invalidation.
> > > > >
> > > > > The way we synchronize between concurrent invalidations and page
> > > > faults is by
> > > > > keeping a counter of currently running invalidations, and a
> > sequence
> > > > number
> > > > > that is incremented whenever an invalidation is caught. The page
> > fault
> > > > code
> > > > > checks the counter and also verifies that the sequence number
> > hasn't
> > > > > progressed before it updates the umem's page tables. This is
> > similar
> > > > to what
> > > > > the kvm module does.
> > > > >
> > > > > There's currently a rare race in the code when registering a umem
> > in
> > > > the
> > > > > middle of an ongoing notifier. The proper fix is to either
> > serialize
> > > > the
> > > > > insertion to our umem tree with mm_lock_all or use a ucontext wide
> > > > running
> > > > > notifiers count for retries decision. Either is ugly and can lead
> > to
> > > > some sort
> > > > > of starvation. The current workaround is ugly as well - now the
> > user
> > > > can end
> > > > > up with mapped addresses that are not in the user's address space
> > > > (although it
> > > > > is highly unlikely).
> > > >
> > > > I have been trying to wrap my head around this comment. I am totaly
> > > > unfamiliar
> > > > with RDMA code, but from quick look at it when registering umem you
> > take
> > > > the
> > > > mmap_sem in read mode so any munmap from userspace would be
> > serialize.
> > > > Really
> > > > the worst that can happen is that a umem pointing to a mmaped file
> > that
> > > > is
> > > > concurently truncated but even then the address is still valid, but
> > it
> > > > should
> > > > result in a SIGBUS which here is obviously harder to report (again
> > dunno
> > > > how
> > > > RDMA works).
> > > >
> > > > So am i missing something ?
> > > >
> > >
> > > Sadly, taking mmap_sem in read-only mode does not prevent all possible
> > invalidations from happening.
> > > For example, a call to madvise requesting MADVISE_DONTNEED will lock
> > the mmap_sem for reading only, allowing a notifier to run in parallel to
> > the MR registration As a result, the following sequence of events could
> > happen:
> > >
> > > Thread 1:                       |   Thread 2
> > > --------------------------------+-------------------------
> > > madvise                         |
> > > down_read(mmap_sem)             |
> > > notifier_start                  |
> > >                                 |   down_read(mmap_sem)
> > >                                 |   register_mr
> > > notifier_end                    |
> > > reduce_mr_notifiers_count       |
> > >
> > > The end result of this sequence is an mr with running notifiers count
> > of -1, which is bad.
> > > The current workaround is to avoid decreasing the notifiers count if
> > it is zero, which can cause other issues.
> > > The proper fix would be to prevent notifiers from running in parallel
> > to registration. For this, taking mmap_sem in write mode might be
> > sufficient, but we are not sure about this.
> > > We will be happy to hear additional input on this subject, to make
> > sure we got it covered properly.
> > 
> > So in HMM i solve this by having a struct allocated in the start range
> > callback
> > and the end range callback just ignore things when it can not find the
> > matching
> > struct.
> 
> This kind of mechanism sounds like it has a bigger risk for deadlocking
> the system, causing an OOM kill without a real need or significantly 
> slowing down the system.
> If you are doing non-atomic memory allocations, you can deadlock the
> system by requesting memory in the swapper flow.
> Even if you are doing atomic memory allocations, you need to handle the
> case of failing allocation, the solution to which is unclear to me.
> If you are using a pre-allocated pool, what are you doing when you run
> out of available entries in the pool? If you are blocking until some
> entries free up, what guarantees you that this will not cause a deadlock?

So i am using a fixed pool and when it runs out it block in start callback
until one is freed. But as i said i have a patch to use the stack that will
solve this and avoid a pool.

> 
> > 
> > That being said when registering the mmu_notifier you need 2 things,
> > first you
> > need a pin on the mm (either mm is current ie current->mm or you took a
> > reference
> > on it). Second you need to that the mmap smemaphore in write mode so
> > that
> > no concurrent mmap/munmap/madvise can happen. By doing that you protect
> > yourself
> > from concurrent range_start/range_end that can happen and that does
> > matter.
> > The only concurrent range_start/end that can happen is through file
> > invalidation
> > which is fine because subsequent page fault will go through the file
> > layer and
> > bring back page or return error (if file was truncated for instance).
> 
> Sadly, this is not sufficient for our use case. We are registering
> a single MMU notifier handler, and broadcast the notifications to
> all relevant listeners, which are stored in an interval tree.
>
> Each listener represents a section of the address space that has been
> exposed to the network. Such implementation allows us to limit the impact
> of invalidations, and only block racing page faults to the affected areas.
> 
> Each of the listeners maintain a counter of the number of invalidate_range
> notifications that are currently affecting it. The counter is increased
> for each invalidate_range_start callback received, and decrease for each
> invalidate_range_end callback received. If we add a listener to the
> interval tree after the invalidate_range_start callback happened, but
> before the invalidate_range_end callback happened, it will decrease the
> counter, reaching negative numbers and breaking the logic.
> 
> The mmu_notifiers registration code avoid such issues by taking all
> relevant locks on the MM. This effectively blocks all possible notifiers
> from happening when registering a new notifier. Sadly, this function is
> not exported for modules to use it.
> 
> Our options at the moment are:
> - Use a tracking mechanism similar to what HMM uses, alongside the
>   challenges involved in allocating memory from notifiers
> 
> - Use a per-process counter for invalidations, causing a possible
>   performance degradation. This can possibly be used as a fallback to the
>   first option (i.e. have a pool of X notifier identifiers, once it is
>   full, increase/decrease a per-MM counter)
> 
> - Export the mm_take_all_locks function for modules. This will allow us
>   to lock the MM when adding a new listener.

I was not clear enough, you need to take the mmap_sem in write mode accross
mmu_notifier_register(). This is only to partialy solve your issue that if
a mmu_notifier is already register for the mm you are trying to registering
against then there is a chance for you to be inside an active range_start/
range_end section which would lead to invalid counter inside your tracking
structure. But, sadly, taking mmap_sem in write mode is not enough as file
invalidation might still happen concurrently so you will need to make sure
you invalidation counters does not go negative but from page fault point of
view you will be fine because the page fault will synchronize through the
pagecache. So scenario (A and B are to anonymous overlapping address range) :

  APP_TOTO_RDMA_THREAD           |  APP_TOTO_SOME_OTHER_THREAD
                                 |  mmu_notifier_invalidate_range_start(A)
  odp_register()                 |
    down_read(mmap_sem)          |
    mmu_notifier_register()      |
    up_read(mmap_sem)            |
  odp_add_new_region(B)          |
  odp_page_fault(B)              |
    down_read(mmap_sem)          |
    ...                          |
    up_read(mmap_sem)            |
                                 |  mmu_notifier_invalidate_range_end(A)

The odp_page_fault(B) might see invalid cpu page table but you have no idea
about it because you registered after the range_start(). But if you take the
mmap_sem in write mode then the only case where you might still have this
scenario is if A and B are range of a file backed vma and that the file is
undergoing some change (most likely truncation). But the file case is fine
because the odp_page_fault() will go through the pagecache which is properly
synchronize against the current range invalidation.

Now for the the general case outside of mmu_notifier_register() HMM also track
active invalidation range to avoid page faulting into those range as we can not
trust the cpu page table for as long as the range invalidation is on going.

> > 
> > So as long as you hold the mmap_sem in write mode you should not worry
> > about
> > concurrent range_start/range_end (well they might happen but only for
> > file
> > backed vma).
> > 
> 
> Sadly, the mmap_sem is not enough to protect us :(.

This is enough like i explain above, but i am only talking about the mmu
notifier registration. For the general case once you register you only
need to take the mmap_sem in read mode during page fault.

> > Given that you face the same issue as i have with the
> > range_start/range_end i
> > will stich up a patch to make it easier to track those.
> > 
> 
> That would be nice, especially if we could easily integrate it into our
> code and reduce the code size.

Yes it's a "small modification" to the mmu_notifier api, i have been side
tracked on other thing. But i will have it soon.

> 
> > Cheers,
> > Jérôme
> > 
> > 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html