RE: [PATCH v1 for-next 06/16] IB/core: Implement support for MMU notifiers regarding on demand paging regions

Shachar Raindel <raindel@xxxxxxxxxxxx> · Thu, 11 Sep 2014 12:19:01 +0000

> -----Original Message-----
> From: Jerome Glisse [mailto:j.glisse@xxxxxxxxx]
> Sent: Wednesday, September 10, 2014 11:15 PM
> To: Shachar Raindel
> Cc: Haggai Eran; linux-rdma@xxxxxxxxxxxxxxx; Sagi Grimberg
> Subject: Re: [PATCH v1 for-next 06/16] IB/core: Implement support for
> MMU notifiers regarding on demand paging regions
> 
> On Wed, Sep 10, 2014 at 09:00:36AM +0000, Shachar Raindel wrote:
> >
> >
> > > -----Original Message-----
> > > From: Jerome Glisse [mailto:j.glisse@xxxxxxxxx]
> > > Sent: Tuesday, September 09, 2014 6:37 PM
> > > To: Shachar Raindel
> > > Cc: 1404377069-20585-1-git-send-email-haggaie@xxxxxxxxxxxx; Haggai
> Eran;
> > > linux-rdma@xxxxxxxxxxxxxxx; Jerome Glisse; Sagi Grimberg
> > > Subject: Re: [PATCH v1 for-next 06/16] IB/core: Implement support
> for
> > > MMU notifiers regarding on demand paging regions
> > >
> > > On Sun, Sep 07, 2014 at 02:35:59PM +0000, Shachar Raindel wrote:
> > > > Hi,
> > > >
> > > > > -----Original Message-----
> > > > > From: Jerome Glisse [mailto:j.glisse@xxxxxxxxx]
> > > > > Sent: Thursday, September 04, 2014 11:25 PM
> > > > > To: Haggai Eran; linux-rdma@xxxxxxxxxxxxxxx
> > > > > Cc: Shachar Raindel; Sagi Grimberg
> > > > > Subject: Re: [PATCH v1 for-next 06/16] IB/core: Implement
> support
> > > for
> > > > > MMU notifiers regarding on demand paging regions
> > > > >

<SNIP>

> > > >
> > > > Sadly, taking mmap_sem in read-only mode does not prevent all
> possible
> > > invalidations from happening.
> > > > For example, a call to madvise requesting MADVISE_DONTNEED will
> lock
> > > the mmap_sem for reading only, allowing a notifier to run in
> parallel to
> > > the MR registration As a result, the following sequence of events
> could
> > > happen:
> > > >
> > > > Thread 1:                       |   Thread 2
> > > > --------------------------------+-------------------------
> > > > madvise                         |
> > > > down_read(mmap_sem)             |
> > > > notifier_start                  |
> > > >                                 |   down_read(mmap_sem)
> > > >                                 |   register_mr
> > > > notifier_end                    |
> > > > reduce_mr_notifiers_count       |
> > > >
> > > > The end result of this sequence is an mr with running notifiers
> count
> > > of -1, which is bad.
> > > > The current workaround is to avoid decreasing the notifiers count
> if
> > > it is zero, which can cause other issues.
> > > > The proper fix would be to prevent notifiers from running in
> parallel
> > > to registration. For this, taking mmap_sem in write mode might be
> > > sufficient, but we are not sure about this.
> > > > We will be happy to hear additional input on this subject, to make
> > > sure we got it covered properly.
> > >
> > > So in HMM i solve this by having a struct allocated in the start
> range
> > > callback
> > > and the end range callback just ignore things when it can not find
> the
> > > matching
> > > struct.
> >
> > This kind of mechanism sounds like it has a bigger risk for
> deadlocking
> > the system, causing an OOM kill without a real need or significantly
> > slowing down the system.
> > If you are doing non-atomic memory allocations, you can deadlock the
> > system by requesting memory in the swapper flow.
> > Even if you are doing atomic memory allocations, you need to handle
> the
> > case of failing allocation, the solution to which is unclear to me.
> > If you are using a pre-allocated pool, what are you doing when you run
> > out of available entries in the pool? If you are blocking until some
> > entries free up, what guarantees you that this will not cause a
> deadlock?
> 
> So i am using a fixed pool and when it runs out it block in start
> callback
> until one is freed. 

This sounds scary. You now create a possible locking dependency between
two code flows which could have run in parallel. This can cause circular
locking bugs, from code which functioned properly until now. For example,
assume code with a single lock, and the following code paths:

Code 1:
notify_start()
lock()
unlock()
notify_end()

Code 2:
lock()
notify_start()
... (no locking)
notify_end()
unlock()

This code can now create the following deadlock:

Thread 1:        | Thread 2:
-----------------+-----------------------------------
notify_start()   |
                 | lock()
lock() - blocking|
                 | notify_start() - blocking for slot

> But as i said i have a patch to use the stack that
> will
> solve this and avoid a pool.

How are you allocating from the stack an entry which you need to keep alive
until another function is called? You can't allocate the entry on the
notify_start stack, so you must do this in all of the call points to the
mmu_notifiers. Given the notifiers listener subscription pattern, this seems
like something which is not practical.

> 
> >
> > >
> > > That being said when registering the mmu_notifier you need 2 things,
> > > first you
> > > need a pin on the mm (either mm is current ie current->mm or you
> took a
> > > reference
> > > on it). Second you need to that the mmap smemaphore in write mode so
> > > that
> > > no concurrent mmap/munmap/madvise can happen. By doing that you
> protect
> > > yourself
> > > from concurrent range_start/range_end that can happen and that does
> > > matter.
> > > The only concurrent range_start/end that can happen is through file
> > > invalidation
> > > which is fine because subsequent page fault will go through the file
> > > layer and
> > > bring back page or return error (if file was truncated for
> instance).
> >
> > Sadly, this is not sufficient for our use case. We are registering
> > a single MMU notifier handler, and broadcast the notifications to
> > all relevant listeners, which are stored in an interval tree.
> >
> > Each listener represents a section of the address space that has been
> > exposed to the network. Such implementation allows us to limit the
> impact
> > of invalidations, and only block racing page faults to the affected
> areas.
> >
> > Each of the listeners maintain a counter of the number of
> invalidate_range
> > notifications that are currently affecting it. The counter is
> increased
> > for each invalidate_range_start callback received, and decrease for
> each
> > invalidate_range_end callback received. If we add a listener to the
> > interval tree after the invalidate_range_start callback happened, but
> > before the invalidate_range_end callback happened, it will decrease
> the
> > counter, reaching negative numbers and breaking the logic.
> >
> > The mmu_notifiers registration code avoid such issues by taking all
> > relevant locks on the MM. This effectively blocks all possible
> notifiers
> > from happening when registering a new notifier. Sadly, this function
> is
> > not exported for modules to use it.
> >
> > Our options at the moment are:
> > - Use a tracking mechanism similar to what HMM uses, alongside the
> >   challenges involved in allocating memory from notifiers
> >
> > - Use a per-process counter for invalidations, causing a possible
> >   performance degradation. This can possibly be used as a fallback to
> the
> >   first option (i.e. have a pool of X notifier identifiers, once it is
> >   full, increase/decrease a per-MM counter)
> >
> > - Export the mm_take_all_locks function for modules. This will allow
> us
> >   to lock the MM when adding a new listener.
> 
> I was not clear enough, you need to take the mmap_sem in write mode
> accross
> mmu_notifier_register(). This is only to partialy solve your issue that
> if
> a mmu_notifier is already register for the mm you are trying to
> registering
> against then there is a chance for you to be inside an active
> range_start/
> range_end section which would lead to invalid counter inside your
> tracking
> structure. But, sadly, taking mmap_sem in write mode is not enough as
> file
> invalidation might still happen concurrently so you will need to make
> sure
> you invalidation counters does not go negative but from page fault point
> of
> view you will be fine because the page fault will synchronize through
> the
> pagecache. So scenario (A and B are to anonymous overlapping address
> range) :
> 
>   APP_TOTO_RDMA_THREAD           |  APP_TOTO_SOME_OTHER_THREAD
>                                  |  mmu_notifier_invalidate_range_start(A)
>   odp_register()                 |
>     down_read(mmap_sem)          |
>     mmu_notifier_register()      |
>     up_read(mmap_sem)            |
>   odp_add_new_region(B)          |
>   odp_page_fault(B)              |
>     down_read(mmap_sem)          |
>     ...                          |
>     up_read(mmap_sem)            |
>                                  |  mmu_notifier_invalidate_range_end(A)
> 
> The odp_page_fault(B) might see invalid cpu page table but you have no
> idea
> about it because you registered after the range_start(). But if you take
> the
> mmap_sem in write mode then the only case where you might still have
> this
> scenario is if A and B are range of a file backed vma and that the file
> is
> undergoing some change (most likely truncation). But the file case is
> fine
> because the odp_page_fault() will go through the pagecache which is
> properly
> synchronize against the current range invalidation.

Specifically, if you call mmu_notifier_register you are OK and the above
scenario will not happen. You are supposed to hold mmap_sem for writing,
and mmu_notifier_register is calling mm_take_all_locks, which guarantees
no racing notifier during the registration step.

However, we want to dynamically add sub-notifiers in our code. Each will
get notified only about invalidations touching a specific sub-sections of
the address space. To avoid providing unneeded notifications, we use an
interval tree that filters only the needed notifications.
When adding entries to the interval tree, we cannot lock the mm to prevent
any racing invalidations. As such, we might end up in a case where a newly
registered memory region will get a "notify_end" call without the relevant
"notify_start". Even if we prevent the value from dropping below zero, it
means we can cause data corruption. For example, if we have another
notifier running after the MR registers, which is due to munmap, but we get
first the notify_end of the previous notifier for which we didn't see the
notify_start.

The solution we are coming up with now is using a global counter of running
invalidations for new regions allocated. When the global counter is at zero,
we can safely switch to the region local invalidations counter.

> 
> 
> Now for the the general case outside of mmu_notifier_register() HMM also
> track
> active invalidation range to avoid page faulting into those range as we
> can not
> trust the cpu page table for as long as the range invalidation is on
> going.
> 
> > >
> > > So as long as you hold the mmap_sem in write mode you should not
> worry
> > > about
> > > concurrent range_start/range_end (well they might happen but only
> for
> > > file
> > > backed vma).
> > >
> >
> > Sadly, the mmap_sem is not enough to protect us :(.
> 
> This is enough like i explain above, but i am only talking about the mmu
> notifier registration. For the general case once you register you only
> need to take the mmap_sem in read mode during page fault.
> 

I think we are not broadcasting on the same wavelength here. The issue I'm
worried about is of adding a sub-area to our tracking system. It is built
quite differently from how HMM is built, we are defining areas to track
a-priori, and later on account how many notifiers are blocking page-faults
for each area. You are keeping track of the active notifiers, and check
each page fault against your notifier list. This difference makes for
different locking needs.

> > > Given that you face the same issue as i have with the
> > > range_start/range_end i
> > > will stich up a patch to make it easier to track those.
> > >
> >
> > That would be nice, especially if we could easily integrate it into
> our
> > code and reduce the code size.
> 
> Yes it's a "small modification" to the mmu_notifier api, i have been
> side
> tracked on other thing. But i will have it soon.
> 

Being side tracked is a well-known professional risk in our line of work ;)

> >
> > > Cheers,
> > > Jérôme
> > >
> > >
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html