Hi Laurent, On Wed, May 06, 2020 at 10:24:48AM +0200, Laurent Vivier wrote: > On 06/05/2020 07:12, Paul Mackerras wrote: > > On Tue, Apr 28, 2020 at 05:57:59PM +0200, Laurent Vivier wrote: > >> On 12/12/2018 05:17, Paul Mackerras wrote: > >>> + if (change == KVM_MR_FLAGS_ONLY && kvm_is_radix(kvm) && > >>> + ((new->flags ^ old->flags) & KVM_MEM_LOG_DIRTY_PAGES)) > >>> + kvmppc_radix_flush_memslot(kvm, old); > >>> } > >> > >> Hi, > >> > >> this part introduces a regression in QEMU test suite. > >> > >> We have the following warning in the host kernel log: > >> > > [snip] > >> > >> The problem is detected with the "migration-test" test program in qemu, > >> on a POWER9 host in radix mode with THP. It happens only the first time > >> the test program is run after loading kvm_hv. The warning is an explicit > >> detection [1] of a problem: > > > > Yes, we found a valid PTE where we didn't expect there to be one. I have now managed to reproduce the problem locally, and I have an explanation for what's going on. QEMU turns on dirty tracking for the memslot for the VM's RAM, and KVM flushes the mappings from the partition-scoped page table, and then proceeds to fill it up with 64k page mappings due to page faults as the VM runs. Then QEMU turns dirty tracking off, while the VM is still running. The new memslot, with the dirty tracking bit clear, becomes visible to page faults before we get to the kvmppc_core_commit_memory_region_hv() function. Thus, page faults occurring during the interval between the calls to install_new_memslots() and kvm_arch_commit_memory_region() will decide to install a 2MB page if there is a THP on the host side. This will hit the case in kvmppc_create_pte() where it wants to install a 2MB leaf PTE but there is a page table pointer there already. It calls kvmppc_unmap_free_pmd_entry_table(), which calls kvmppc_unmap_free_pte(), which finds the existing 64k PTEs and generates the warning. Now, the code in kvmppc_unmap_free_pte() does the right thing, in that it calls kvmppc_unmap_pte to deal with the PTE it found. The warning was just an attempt to catch code bugs rather than an indication of any immediate and obvious problem. Since we now know of one situation where this can legitimately happen, I think we should just take out the WARN_ON_ONCE, along with the scary comment. If people want to be more paranoid than that, we could add a check that the existing PTEs all map sub-pages of the 2MB page that we're about to map. There is another race which is possible, which is that when turning on dirty page tracking, a parallel page fault reads the memslot flags before the new memslots are installed, and inserts its PTE after kvmppc_core_commit_memory_region_hv has flushed the memslot. In that case, we would have a 2MB PTE left over, which would result in coarser dirty tracking granularity for that 2MB page. I think this would be very hard to hit in practice, and that having coarser dirty tracking granularity for one 2MB page of the VM would not cause any practical problem. That race could be closed by bumping the kvm->mmu_notifier_seq while the kvm->mmu_lock is held in kvmppc_radix_flush_memslot(). Thoughts? Paul.