Re: [PATCH 4/4] KVM: PPC: Book3S HV: Flush guest mappings when turning dirty tracking on/off

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Laurent,

On Wed, May 06, 2020 at 10:24:48AM +0200, Laurent Vivier wrote:
> On 06/05/2020 07:12, Paul Mackerras wrote:
> > On Tue, Apr 28, 2020 at 05:57:59PM +0200, Laurent Vivier wrote:
> >> On 12/12/2018 05:17, Paul Mackerras wrote:
> >>> +	if (change == KVM_MR_FLAGS_ONLY && kvm_is_radix(kvm) &&
> >>> +	    ((new->flags ^ old->flags) & KVM_MEM_LOG_DIRTY_PAGES))
> >>> +		kvmppc_radix_flush_memslot(kvm, old);
> >>>  }
> >>
> >> Hi,
> >>
> >> this part introduces a regression in QEMU test suite.
> >>
> >> We have the following warning in the host kernel log:
> >>
> > [snip]
> >>
> >> The problem is detected with the "migration-test" test program in qemu,
> >> on a POWER9 host in radix mode with THP. It happens only the first time
> >> the test program is run after loading kvm_hv. The warning is an explicit
> >> detection [1] of a problem:
> > 
> > Yes, we found a valid PTE where we didn't expect there to be one.

I have now managed to reproduce the problem locally, and I have an
explanation for what's going on.  QEMU turns on dirty tracking for the
memslot for the VM's RAM, and KVM flushes the mappings from the
partition-scoped page table, and then proceeds to fill it up with 64k
page mappings due to page faults as the VM runs.

Then QEMU turns dirty tracking off, while the VM is still running.
The new memslot, with the dirty tracking bit clear, becomes visible to
page faults before we get to the kvmppc_core_commit_memory_region_hv()
function.  Thus, page faults occurring during the interval between the
calls to install_new_memslots() and kvm_arch_commit_memory_region()
will decide to install a 2MB page if there is a THP on the host side.
This will hit the case in kvmppc_create_pte() where it wants to
install a 2MB leaf PTE but there is a page table pointer there
already.  It calls kvmppc_unmap_free_pmd_entry_table(), which calls
kvmppc_unmap_free_pte(), which finds the existing 64k PTEs and
generates the warning.

Now, the code in kvmppc_unmap_free_pte() does the right thing, in that
it calls kvmppc_unmap_pte to deal with the PTE it found.  The warning
was just an attempt to catch code bugs rather than an indication of
any immediate and obvious problem.  Since we now know of one situation
where this can legitimately happen, I think we should just take out
the WARN_ON_ONCE, along with the scary comment.  If people want to be
more paranoid than that, we could add a check that the existing PTEs
all map sub-pages of the 2MB page that we're about to map.

There is another race which is possible, which is that when turning on
dirty page tracking, a parallel page fault reads the memslot flags
before the new memslots are installed, and inserts its PTE after
kvmppc_core_commit_memory_region_hv has flushed the memslot.  In that
case, we would have a 2MB PTE left over, which would result in coarser
dirty tracking granularity for that 2MB page.  I think this would be
very hard to hit in practice, and that having coarser dirty tracking
granularity for one 2MB page of the VM would not cause any practical
problem.

That race could be closed by bumping the kvm->mmu_notifier_seq while
the kvm->mmu_lock is held in kvmppc_radix_flush_memslot().

Thoughts?

Paul.



[Index of Archives]     [KVM Development]     [KVM ARM]     [KVM ia64]     [Linux Virtualization]     [Linux USB Devel]     [Linux Video]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [Big List of Linux Books]

  Powered by Linux