On Tue, Apr 20, 2021, Paolo Bonzini wrote: > On 19/04/21 17:09, Sean Christopherson wrote: > > > - this loses the rwsem fairness. On the other hand, mm/mmu_notifier.c's > > > own interval-tree-based filter is also using a similar mechanism that is > > > likewise not fair, so it should be okay. > > > > The one concern I had with an unfair mechanism of this nature is that, in theory, > > the memslot update could be blocked indefinitely. > > Yep, that's why I mentioned it. > > > > @@ -1333,9 +1351,22 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm, > > > WARN_ON(gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS); > > > slots->generation = gen | KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS; > > > - down_write(&kvm->mmu_notifier_slots_lock); > > > + /* > > > + * This cannot be an rwsem because the MMU notifier must not run > > > + * inside the critical section. A sleeping rwsem cannot exclude > > > + * that. > > > > How on earth did you decipher that from the splat? I stared at it for a good > > five minutes and was completely befuddled. > > Just scratch that, it makes no sense. It's much simpler, but you have > to look at include/linux/mmu_notifier.h to figure it out: LOL, glad you could figure it out, I wasn't getting anywhere, mmu_notifier.h or not. > invalidate_range_start > take pseudo lock > down_read() (*) > release pseudo lock > invalidate_range_end > take pseudo lock (**) > up_read() > release pseudo lock > > At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock; > at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock. > > This could cause a deadlock (ignoring for a second that the pseudo lock > is not a lock): > > - invalidate_range_start waits on down_read(), because the rwsem is > held by install_new_memslots > > - install_new_memslots waits on down_write(), because the rwsem is > held till (another) invalidate_range_end finishes > > - invalidate_range_end sits waits on the pseudo lock, held by > invalidate_range_start. > > Removing the fairness of the rwsem breaks the cycle (in lockdep terms, > it would change the *shared* rwsem readers into *shared recursive* > readers). This also means that there's no need for a raw spinlock. Ahh, thanks, this finally made things click. > Given this simple explanation, I think it's okay to include this LOL, "simple". > patch in the merge window pull request, with the fix after my > signature squashed in. The fix actually undoes a lot of the > changes to __kvm_handle_hva_range that this patch made, so the > result is relatively simple. You can already find the result > in kvm/queue. ... > static __always_inline int __kvm_handle_hva_range(struct kvm *kvm, > const struct kvm_hva_range *range) > { > @@ -515,10 +495,6 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm, > idx = srcu_read_lock(&kvm->srcu); > - if (range->must_lock && > - kvm_mmu_lock_and_check_handler(kvm, range, &locked)) > - goto out_unlock; > - > for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { > slots = __kvm_memslots(kvm, i); > kvm_for_each_memslot(slot, slots) { > @@ -547,8 +523,14 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm, > gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot); > gfn_range.slot = slot; > - if (kvm_mmu_lock_and_check_handler(kvm, range, &locked)) > - goto out_unlock; > + if (!locked) { > + locked = true; > + KVM_MMU_LOCK(kvm); > + if (!IS_KVM_NULL_FN(range->on_lock)) > + range->on_lock(kvm, range->start, range->end); > + if (IS_KVM_NULL_FN(range->handler)) > + break; This can/should be "goto out_unlock", "break" only takes us out of the memslots walk, we want to get out of the address space loop. Not a functional problem, but we might walk all SMM memslots unnecessarily. > + } > ret |= range->handler(kvm, &gfn_range); > } > @@ -557,7 +539,6 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm, > if (range->flush_on_ret && (ret || kvm->tlbs_dirty)) > kvm_flush_remote_tlbs(kvm); > -out_unlock: > if (locked) > KVM_MMU_UNLOCK(kvm); > @@ -580,7 +561,6 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn, > .pte = pte, > .handler = handler, > .on_lock = (void *)kvm_null_fn, > - .must_lock = false, > .flush_on_ret = true, > .may_block = false, > }; > @@ -600,7 +580,6 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn > .pte = __pte(0), > .handler = handler, > .on_lock = (void *)kvm_null_fn, > - .must_lock = false, > .flush_on_ret = false, > .may_block = false, > }; > @@ -620,13 +599,11 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, > * .change_pte() must be surrounded by .invalidate_range_{start,end}(), While you're squashing, want to change the above comma to a period? > * If mmu_notifier_count is zero, then start() didn't find a relevant > * memslot and wasn't forced down the slow path; rechecking here is > - * unnecessary. This can only occur if memslot updates are blocked; > - * otherwise, mmu_notifier_count is incremented unconditionally. > + * unnecessary. > */ > - if (!kvm->mmu_notifier_count) { > - lockdep_assert_held(&kvm->mmu_notifier_slots_lock); > + WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count)); > + if (!kvm->mmu_notifier_count) > return; > - } > kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn); > } ... > @@ -1333,9 +1315,22 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm, > WARN_ON(gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS); > slots->generation = gen | KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS; > - down_write(&kvm->mmu_notifier_slots_lock); > + /* > + * This cannot be an rwsem because the MMU notifier must not run > + * inside the critical section, which cannot be excluded with a > + * sleeping rwsem. Any objection to replcaing this comment with a rephrased version of your statement about "shared" vs. "shared recursive" and breaking the fairness cycle? IIUC, it's not "running inside the critical section" that's problematic, it's that sleeping in down_write() can cause deadlock due to blocking future readers. Thanks much! > + */ > + spin_lock(&kvm->mn_invalidate_lock); > + prepare_to_rcuwait(&kvm->mn_memslots_update_rcuwait); > + while (kvm->mn_active_invalidate_count) { > + set_current_state(TASK_UNINTERRUPTIBLE); > + spin_unlock(&kvm->mn_invalidate_lock); > + schedule(); > + spin_lock(&kvm->mn_invalidate_lock); > + } > + finish_rcuwait(&kvm->mn_memslots_update_rcuwait); > rcu_assign_pointer(kvm->memslots[as_id], slots); > - up_write(&kvm->mmu_notifier_slots_lock); > + spin_unlock(&kvm->mn_invalidate_lock); > synchronize_srcu_expedited(&kvm->srcu); > -- > 2.26.2 >