I haven't had enough time to fully understand the deferred logic in this change. I spotted one problem, see comments inline. On 2019-10-28 4:10 p.m., Jason Gunthorpe wrote: > From: Jason Gunthorpe <jgg@xxxxxxxxxxxx> > > Of the 13 users of mmu_notifiers, 8 of them use only > invalidate_range_start/end() and immediately intersect the > mmu_notifier_range with some kind of internal list of VAs. 4 use an > interval tree (i915_gem, radeon_mn, umem_odp, hfi1). 4 use a linked list > of some kind (scif_dma, vhost, gntdev, hmm) > > And the remaining 5 either don't use invalidate_range_start() or do some > special thing with it. > > It turns out that building a correct scheme with an interval tree is > pretty complicated, particularly if the use case is synchronizing against > another thread doing get_user_pages(). Many of these implementations have > various subtle and difficult to fix races. > > This approach puts the interval tree as common code at the top of the mmu > notifier call tree and implements a shareable locking scheme. > > It includes: > - An interval tree tracking VA ranges, with per-range callbacks > - A read/write locking scheme for the interval tree that avoids > sleeping in the notifier path (for OOM killer) > - A sequence counter based collision-retry locking scheme to tell > device page fault that a VA range is being concurrently invalidated. > > This is based on various ideas: > - hmm accumulates invalidated VA ranges and releases them when all > invalidates are done, via active_invalidate_ranges count. > This approach avoids having to intersect the interval tree twice (as > umem_odp does) at the potential cost of a longer device page fault. > > - kvm/umem_odp use a sequence counter to drive the collision retry, > via invalidate_seq > > - a deferred work todo list on unlock scheme like RTNL, via deferred_list. > This makes adding/removing interval tree members more deterministic > > - seqlock, except this version makes the seqlock idea multi-holder on the > write side by protecting it with active_invalidate_ranges and a spinlock > > To minimize MM overhead when only the interval tree is being used, the > entire SRCU and hlist overheads are dropped using some simple > branches. Similarly the interval tree overhead is dropped when in hlist > mode. > > The overhead from the mandatory spinlock is broadly the same as most of > existing users which already had a lock (or two) of some sort on the > invalidation path. > > Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx> > Cc: Michal Hocko <mhocko@xxxxxxxxxx> > Acked-by: Christian König <christian.koenig@xxxxxxx> > Signed-off-by: Jason Gunthorpe <jgg@xxxxxxxxxxxx> > --- > include/linux/mmu_notifier.h | 98 +++++++ > mm/Kconfig | 1 + > mm/mmu_notifier.c | 533 +++++++++++++++++++++++++++++++++-- > 3 files changed, 607 insertions(+), 25 deletions(-) > [snip] > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c > index 367670cfd02b7b..d02d3c8c223eb7 100644 > --- a/mm/mmu_notifier.c > +++ b/mm/mmu_notifier.c [snip] > * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap > @@ -52,17 +286,24 @@ struct mmu_notifier_mm { > * can't go away from under us as exit_mmap holds an mm_count pin > * itself. > */ > -void __mmu_notifier_release(struct mm_struct *mm) > +static void mn_hlist_release(struct mmu_notifier_mm *mmn_mm, > + struct mm_struct *mm) > { > struct mmu_notifier *mn; > int id; > > + if (mmn_mm->has_interval) > + mn_itree_release(mmn_mm, mm); > + > + if (hlist_empty(&mmn_mm->list)) > + return; This seems to duplicate the conditions in __mmu_notifier_release. See my comments below, I think one of them is wrong. I suspect this one, because __mmu_notifier_release follows the same pattern as the other notifiers. > + > /* > * SRCU here will block mmu_notifier_unregister until > * ->release returns. > */ > id = srcu_read_lock(&srcu); > - hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) > + hlist_for_each_entry_rcu(mn, &mmn_mm->list, hlist) > /* > * If ->release runs before mmu_notifier_unregister it must be > * handled, as it's the only way for the driver to flush all > @@ -72,9 +313,9 @@ void __mmu_notifier_release(struct mm_struct *mm) > if (mn->ops->release) > mn->ops->release(mn, mm); > > - spin_lock(&mm->mmu_notifier_mm->lock); > - while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) { > - mn = hlist_entry(mm->mmu_notifier_mm->list.first, > + spin_lock(&mmn_mm->lock); > + while (unlikely(!hlist_empty(&mmn_mm->list))) { > + mn = hlist_entry(mmn_mm->list.first, > struct mmu_notifier, > hlist); > /* > @@ -85,7 +326,7 @@ void __mmu_notifier_release(struct mm_struct *mm) > */ > hlist_del_init_rcu(&mn->hlist); > } > - spin_unlock(&mm->mmu_notifier_mm->lock); > + spin_unlock(&mmn_mm->lock); > srcu_read_unlock(&srcu, id); > > /* > @@ -100,6 +341,17 @@ void __mmu_notifier_release(struct mm_struct *mm) > synchronize_srcu(&srcu); > } > > +void __mmu_notifier_release(struct mm_struct *mm) > +{ > + struct mmu_notifier_mm *mmn_mm = mm->mmu_notifier_mm; > + > + if (mmn_mm->has_interval) > + mn_itree_release(mmn_mm, mm); If mmn_mm->list is not empty, this will be done twice because mn_hlist_release duplicates this. > + > + if (!hlist_empty(&mmn_mm->list)) > + mn_hlist_release(mmn_mm, mm); mn_hlist_release checks the same condition itself. > +} > + > /* > * If no young bitflag is supported by the hardware, ->clear_flush_young can > * unmap the address and return 1 or 0 depending if the mapping previously [snip]