On 12/12/2018 05:17, Paul Mackerras wrote: > This adds code to flush the partition-scoped page tables for a radix > guest when dirty tracking is turned on or off for a memslot. Only the > guest real addresses covered by the memslot are flushed. The reason > for this is to get rid of any 2M PTEs in the partition-scoped page > tables that correspond to host transparent huge pages, so that page > dirtiness is tracked at a system page (4k or 64k) granularity rather > than a 2M granularity. The page tables are also flushed when turning > dirty tracking off so that the memslot's address space can be > repopulated with THPs if possible. > > To do this, we add a new function kvmppc_radix_flush_memslot(). Since > this does what's needed for kvmppc_core_flush_memslot_hv() on a radix > guest, we now make kvmppc_core_flush_memslot_hv() call the new > kvmppc_radix_flush_memslot() rather than calling kvm_unmap_radix() > for each page in the memslot. This has the effect of fixing a bug in > that kvmppc_core_flush_memslot_hv() was previously calling > kvm_unmap_radix() without holding the kvm->mmu_lock spinlock, which > is required to be held. > > Signed-off-by: Paul Mackerras <paulus@xxxxxxxxxx> > Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@xxxxxxxxx> > Reviewed-by: David Gibson <david@xxxxxxxxxxxxxxxxxxxxx> > --- > arch/powerpc/include/asm/kvm_book3s.h | 2 ++ > arch/powerpc/kvm/book3s_64_mmu_hv.c | 9 +++++---- > arch/powerpc/kvm/book3s_64_mmu_radix.c | 20 ++++++++++++++++++++ > arch/powerpc/kvm/book3s_hv.c | 17 +++++++++++++++++ > 4 files changed, 44 insertions(+), 4 deletions(-) > ... > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c > index f4fbb7b5..074ff5b 100644 > --- a/arch/powerpc/kvm/book3s_hv.c > +++ b/arch/powerpc/kvm/book3s_hv.c > @@ -4384,6 +4384,23 @@ static void kvmppc_core_commit_memory_region_hv(struct kvm *kvm, > */ > if (npages) > atomic64_inc(&kvm->arch.mmio_update); > + > + /* > + * For change == KVM_MR_MOVE or KVM_MR_DELETE, higher levels > + * have already called kvm_arch_flush_shadow_memslot() to > + * flush shadow mappings. For KVM_MR_CREATE we have no > + * previous mappings. So the only case to handle is > + * KVM_MR_FLAGS_ONLY when the KVM_MEM_LOG_DIRTY_PAGES bit > + * has been changed. > + * For radix guests, we flush on setting KVM_MEM_LOG_DIRTY_PAGES > + * to get rid of any THP PTEs in the partition-scoped page tables > + * so we can track dirtiness at the page level; we flush when > + * clearing KVM_MEM_LOG_DIRTY_PAGES so that we can go back to > + * using THP PTEs. > + */ > + if (change == KVM_MR_FLAGS_ONLY && kvm_is_radix(kvm) && > + ((new->flags ^ old->flags) & KVM_MEM_LOG_DIRTY_PAGES)) > + kvmppc_radix_flush_memslot(kvm, old); > } Hi, this part introduces a regression in QEMU test suite. We have the following warning in the host kernel log: [ 96.631336] ------------[ cut here ]------------ [ 96.631372] WARNING: CPU: 32 PID: 4095 at arch/powerpc/kvm/book3s_64_mmu_radix.c:436 kvmppc_unmap_free_pte+0x84/0x150 [kvm_hv] [ 96.631394] Modules linked in: xt_CHECKSUM nft_chain_nat xt_MASQUERADE nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 nft_counter nft_compat nf_tables nfnetlink tun bridge stp llc rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rfkill kvm_hv kvm i2c_dev sunrpc ext4 mbcache jbd2 xts ofpart ses enclosure vmx_crypto scsi_transport_sas at24 ipmi_powernv powernv_flash ipmi_devintf opal_prd mtd ipmi_msghandler uio_pdrv_genirq uio ibmpowernv ip_tables xfs libcrc32c ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec drm sd_mod i40e t10_pi sg aacraid drm_panel_orientation_quirks dm_mirror dm_region_hash dm_log dm_mod [ 96.631553] CPU: 32 PID: 4095 Comm: CPU 0/KVM Not tainted 5.7.0-rc3 #24 [ 96.631571] NIP: c008000007745d1c LR: c008000007746c20 CTR: c000000000098aa0 [ 96.631598] REGS: c0000003b842f440 TRAP: 0700 Not tainted (5.7.0-rc3) [ 96.631615] MSR: 900000010282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 24222448 XER: 20040000 [ 96.631648] CFAR: c008000007746c1c IRQMASK: 0 [ 96.631648] GPR00: c008000007746c20 c0000003b842f6d0 c00800000775b300 c000000008a60000 [ 96.631648] GPR04: c00000039b835200 c0000003aae00387 0000000000000001 000000009b835205 [ 96.631648] GPR08: 0552839b03000080 0000000000000020 0000000000000005 c00800000774dd88 [ 96.631648] GPR12: c000000000098aa0 c0000007fffdb000 0000000000000000 0000000000000000 [ 96.631648] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 96.631648] GPR20: 0000000000000000 0000000000000001 8703e0aa030000c0 c00000000183d9d8 [ 96.631648] GPR24: c00000000183d9e0 c00000039b835200 0000000000000001 c000000008a60000 [ 96.631648] GPR28: 0000000000000001 c00000000183d9e8 0000000000000000 c00000039b835200 [ 96.631784] NIP [c008000007745d1c] kvmppc_unmap_free_pte+0x84/0x150 [kvm_hv] [ 96.631813] LR [c008000007746c20] kvmppc_create_pte+0x948/0xa90 [kvm_hv] [ 96.631830] Call Trace: [ 96.631845] [c0000003b842f6d0] [00007fff60740000] 0x7fff60740000 (unreliable) [ 96.631866] [c0000003b842f730] [c008000007746c20] kvmppc_create_pte+0x948/0xa90 [kvm_hv] [ 96.631886] [c0000003b842f7d0] [c0080000077470f8] kvmppc_book3s_instantiate_page+0x270/0x5c0 [kvm_hv] [ 96.631920] [c0000003b842f8c0] [c008000007747618] kvmppc_book3s_radix_page_fault+0x1d0/0x300 [kvm_hv] [ 96.631951] [c0000003b842f970] [c008000007741f7c] kvmppc_book3s_hv_page_fault+0x1f4/0xd50 [kvm_hv] [ 96.631981] [c0000003b842fa60] [c00800000773da34] kvmppc_vcpu_run_hv+0x9bc/0xba0 [kvm_hv] [ 96.632006] [c0000003b842fb30] [c008000007c6aabc] kvmppc_vcpu_run+0x34/0x48 [kvm] [ 96.632030] [c0000003b842fb50] [c008000007c6686c] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm] [ 96.632061] [c0000003b842fbe0] [c008000007c57fb8] kvm_vcpu_ioctl+0x340/0x7d0 [kvm] [ 96.632082] [c0000003b842fd50] [c0000000004c9598] ksys_ioctl+0xf8/0x150 [ 96.632100] [c0000003b842fda0] [c0000000004c9618] sys_ioctl+0x28/0x80 [ 96.632118] [c0000003b842fdc0] [c000000000034b34] system_call_exception+0x104/0x1d0 [ 96.632146] [c0000003b842fe20] [c00000000000c970] system_call_common+0xf0/0x278 [ 96.632165] Instruction dump: [ 96.632181] 7cda3378 7c9f2378 3bc00000 3b800001 e95d0000 7d295030 2f890000 419e0054 [ 96.632211] 60000000 7ca0fc28 2fa50000 419e0028 <0fe00000> 78a55924 7f48d378 7fe4fb78 [ 96.632241] ---[ end trace 4f8aa0280f1215fb ]--- The problem is detected with the "migration-test" test program in qemu, on a POWER9 host in radix mode with THP. It happens only the first time the test program is run after loading kvm_hv. The warning is an explicit detection [1] of a problem: arch/powerpc/kvm/book3s_64_mmu_radix.c: 414 /* 415 * kvmppc_free_p?d are used to free existing page tables, and recursively 416 * descend and clear and free children. 417 * Callers are responsible for flushing the PWC. 418 * 419 * When page tables are being unmapped/freed as part of page fault path 420 * (full == false), ptes are not expected. There is code to unmap them 421 * and emit a warning if encountered, but there may already be data 422 * corruption due to the unexpected mappings. 423 */ 424 static void kvmppc_unmap_free_pte(struct kvm *kvm, pte_t *pte, bool full, 425 unsigned int lpid) 426 { 427 if (full) { 428 memset(pte, 0, sizeof(long) << RADIX_PTE_INDEX_SIZE); 429 } else { 430 pte_t *p = pte; 431 unsigned long it; 432 433 for (it = 0; it < PTRS_PER_PTE; ++it, ++p) { 434 if (pte_val(*p) == 0) 435 continue; 436 WARN_ON_ONCE(1); 437 kvmppc_unmap_pte(kvm, p, 438 pte_pfn(*p) << PAGE_SHIFT, 439 PAGE_SHIFT, NULL, lpid); 440 } 441 } 442 443 kvmppc_pte_free(pte); 444 } I reproduce the problem in QEMU 4.2 build directory with : sudo rmmod kvm_hv sudo modprobe kvm_hv make check-qtest-ppc64 or once the test binary is built with sudo rmmod kvm_hv sudo modprobe kvm_hv export QTEST_QEMU_BINARY=ppc64-softmmu/qemu-system-ppc64 tests/qtest/migration-test The sub-test is "/migration/validate_uuid_error" that generates some memory stress with a SLOF program in the source guest and fails because of a mismatch with the destination guest. So the source guest continues to execute the stress program while the migration is aborted. Another way to reproduce the problem is: Source guest: sudo rmmod kvm-hv sudo modprobe kvm-hv qemu-system-ppc64 -display none -accel kvm -name source -m 256M \ -nodefaults -prom-env use-nvramrc?=true \ -prom-env 'nvramrc=hex ." _" begin 6400000 100000 \ do i c@ 1 + i c! 1000 +loop ." B" 0 until' -monitor \ stdio Destination guest (on the same host): qemu-system-ppc64 -display none -accel kvm -name source -m 512M \ -nodefaults -monitor stdio -incoming tcp:0:4444 Then in source guest monitor: (qemu) migrate tcp:localhost:4444 The migration intentionally fails because of a memory size mismatch and the warning is generated in the host kernel log. It was not detected earlier because the problem is only detected with a test added in qemu-4.2.0 [2] with THP enabled that starts a migration and fails, so the memory stress program continues to run in the source guest while the dirty log bitmap is released. The problem cannot be shown with qemu-5.0 because a change in the migration speed has been added [3] The problem seems to happen because QEMU modifies the memory map while a vCPU is running: (qemu) migrate tcp:localhost:4444 108264@1578573121.242431:kvm_run_exit cpu_index 0, reason 19 108264@1578573121.242454:kvm_vcpu_ioctl cpu_index 0, type 0x2000ae80, arg (nil) 108264@1578573121.248110:kvm_run_exit cpu_index 0, reason 19 108281@1578573121.248710:kvm_set_user_memory Slot#0 flags=0x1 gpa=0x0 size=0x10000000 ua=0x7fff8fe00000 ret=0 -> 108264@1578573121.249335:kvm_vcpu_ioctl cpu_index 0, type 0x2000ae80, arg (nil) -> 108259@1578573121.253111:kvm_set_user_memory Slot#0 flags=0x0 gpa=0x0 size=0x10000000 ua=0x7fff8fe00000 ret=0 -> 108264@1578573121.256593:kvm_run_exit cpu_index 0, reason 19 This is part of the cleanup function after the migration has failed: migration_iteration_finish migrate_fd_cleanup_schedule() migrate_fd_cleanup_bh ... migrate_fd_cleanup qemu_savevm_state_cleanup ram_save_cleanup memory_global_dirty_log_stop memory_global_dirty_log_do_stop .. If I pause the VM in qemu_savevm_state_cleanup() while save_cleanup() functions are called (ram_save_cleanup()), the warning disappears. Any idea? Thanks, Laurent [1] warning introduced by a5704e83aa3d KVM: PPC: Book3S HV: Recursively unmap all page table entries when unmapping [2] 3af31a3469 "tests/migration: Add a test for validate-uuid capability" [3] 97e1e06780 "migration: Rate limit inside host pages"