The kvm mmu_notifier_seq is used to communicate that a page mapping has changed to code which is using that information in constructing a different but reliant page mapping. For example when constructing a mapping for a nested guest it is used to detect when the guest mapping has changed, which would render the nested guest mapping invalid. When running nested guests it is important that the rc bits are kept in sync between the 2 ptes on the host in which they exist; the pte for the guest, and the pte for the nested guest. This is done when inserting the nested pte in __kvmhv_nested_page_fault_radix() by reducing the rc bits being set in the nested pte to those already set in the guest pte. And when setting the bits in the nested pte in response to an interrupt in kvmhv_handle_nested_set_rc_radix() the same bits are also set in the guest pte, with the bits not set in the nested pte if this fails. When the host wants to remove rc bits from the guest pte in kvm_radix_test_clear_dirty(), if first removes then from the guest pte and then from any corresponding nested ptes which map the same guest page. This means that there is a window between which the rc bits could get out of sync between the two ptes as they might have been seen as set in the guest pte and thus updated in the nested pte assuming as such, while the host might be in the process of removing those rc bits leading to an inconsistency. In kvm_radix_test_clear_dirty() the mmu_lock spin lock is held across removing the rc bits from the guest and nested pte, and the same is done across updating the rc bits in the guest and nested pte in kvmhv_handle_nested_set_rc_radix() and so there is no window for them to get out of sync in this case. However when constructing the pte in __kvmhv_nested_page_fault_radix() we drop the mmu_lock spin lock between reading the guest pte and inserting the nested pte, presenting a window for them to get out of sync. This is because the rc bits could have been observed as set in the guest pte and set in the nested pte accordingly, however in the mean time the rc bits in the guest pte could have been cleared, and since the nested pte wasn't yet inserted there is no way for the kvm_radix_test_clear_dirty() function to clear them and so an inconsistency can arise. To avoid the possibility of the rc bits getting out of sync, increment the mmu_notifier_seq in kvm_radix_test_clear_dirty() under the mmu_lock when clearing rc bits. This means that when inserting the nested pte in __kvmhv_nested_page_fault_radix() we will bail out and retry when we see that the mmu_seq differs indicating that the guest pte has changed. Fixes: ae59a7e1945b ("KVM: PPC: Book3S HV: Keep rc bits in shadow pgtable in sync with host") Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@xxxxxxxxx> --- arch/powerpc/kvm/book3s_64_mmu_radix.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c index 2d415c36a61d..310d8dde9a48 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c @@ -1044,6 +1044,8 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm, kvmhv_update_nest_rmap_rc_list(kvm, rmapp, _PAGE_DIRTY, 0, old & PTE_RPN_MASK, 1UL << shift); + /* Notify anyone trying to map the page that it has changed */ + kvm->mmu_notifier_seq++; spin_unlock(&kvm->mmu_lock); } return ret; -- 2.13.6