Many users of the mmu_notifier invalidate_range callbacks maintain locking/counters/etc on a paired basis and have long expected that invalidate_range start/end are always paired. The recent change to add non-blocking notifiers breaks this assumption when multiple notifiers are present in the list as an EAGAIN return from a later notifier causes all earlier notifiers to get their invalidate_range_end() skipped. During the development of non-blocking each user was audited to be sure they can skip their invalidate_range_end() if their start returns -EAGAIN, so the only place that has a problem is when there are multiple subscriptions. Due to the RCU locking we can't reliably generate a subset of the linked list representing the notifiers already called, and generate an invalidate_range_end() pairing. Rather than design an elaborate fix, for now, just block non-blocking requests early on if there are multiple subscriptions. Fixes: 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers") Cc: Michal Hocko <mhocko@xxxxxxxx> Cc: "Jérôme Glisse" <jglisse@xxxxxxxxxx> Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx> Cc: Christoph Hellwig <hch@xxxxxx> Signed-off-by: Jason Gunthorpe <jgg@xxxxxxxxxxxx> --- include/linux/mmu_notifier.h | 1 + mm/mmu_notifier.c | 15 +++++++++++++++ 2 files changed, 16 insertions(+) HCH suggested to make the locking common so we don't need to have an invalidate_range_end, but that is a longer journey. Here is a simpler stop-gap for this bug. What do you think Michal? I don't have a good way to test this flow .. This lightly clashes with the other mmu notififer series I just sent, so it should go to either -rc or hmm.git Thanks, Jason diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index b6c004bd9f6ad9..170fa2c65d659c 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -53,6 +53,7 @@ struct mmu_notifier_mm { struct hlist_head list; /* to serialize the list modifications and hlist_unhashed */ spinlock_t lock; + bool multiple_subscriptions; }; #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0) diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index b5670620aea0fc..4e56f75c560242 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -171,6 +171,19 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) int ret = 0; int id; + /* + * If there is more than one notififer subscribed to this mm then we + * cannot support the EAGAIN return. invalidate_range_start/end() must + * always be paired unless start returns -EAGAIN. When we return + * -EAGAIN from here the caller will skip all invalidate_range_end() + * calls. However, if there is more than one notififer then some + * notifiers may have had a successful invalidate_range_start() - + * causing imbalance when the end is skipped. + */ + if (!mmu_notifier_range_blockable(range) && + range->mm->mmu_notifier_mm->multiple_subscriptions) + return -EAGAIN; + id = srcu_read_lock(&srcu); hlist_for_each_entry_rcu(mn, &range->mm->mmu_notifier_mm->list, hlist) { if (mn->ops->invalidate_range_start) { @@ -274,6 +287,8 @@ static int do_mmu_notifier_register(struct mmu_notifier *mn, * thanks to mm_take_all_locks(). */ spin_lock(&mm->mmu_notifier_mm->lock); + mm->mmu_notifier_mm->multiple_subscriptions = + !hlist_empty(&mm->mmu_notifier_mm->list); hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier_mm->list); spin_unlock(&mm->mmu_notifier_mm->lock); -- 2.22.0