[PATCH 5.4 1/1] KVM: SEV: add cache flush to solve SEV cache incoherency issues

Ashish Kalra <Ashish.Kalra@xxxxxxx> · Tue, 27 Sep 2022 00:07:29 +0000

With this patch applied, we are observing soft lockup and RCU stall issues on SNP guests with
128 vCPUs assigned and >=10GB guest memory allocations.

>From the call stack dumps, it looks like migrate_pages() gets invoked for hugepages and triggers the
MMU notifiers for invalidate_range_start() and correspondingly kvm_mmu_notifier_invalidate_range_start() 
invokes sev_guest_memory_reclaimed() which internally does wbinvd_on_all_cpus(). This can potentially 
cause long delays especially on large physical CPU count systems (the one we are testing on has 500 CPUs)
and thus delay guest re-entry and cause soft-lockups and RCU stalls on the guest.

Here are the kstack dumps of vCPU thread(s) invoking the invalidate_range_start() MMU notifiers: 

#1: 
[  913.377780] CPU: 79 PID: 6538 Comm: qemu-system-x86 Not tainted 5.19.0-rc5-next-20220706-sev-es-snp+ #380
[  913.377783] Hardware name: AMD Corporation QUARTZ/QUARTZ, BIOS RQZ1002C 09/15/2022
[  913.377785] Call Trace:
[  913.377788]  <TASK>
[  913.304300]  sev_guest_memory_reclaimed.cold+0x18/0x22
[  913.304303]  kvm_arch_guest_memory_reclaimed+0x12/0x20
[  913.304309]  kvm_mmu_notifier_invalidate_range_start+0x2af/0x2e0
[  913.304312]  ? kvm_mmu_notifier_invalidate_range_end+0x101/0x1c0
[  913.304314]  __mmu_notifier_invalidate_range_start+0x83/0x190
[  913.304320]  try_to_migrate_one+0xba9/0xd80
[  913.304326]  rmap_walk_anon+0x166/0x360
[  913.304329]  rmap_walk+0x28/0x40
[  913.304331]  try_to_migrate+0x92/0xd0
[  913.304334]  ? try_to_unmap_one+0xe60/0xe60
[  913.304336]  ? anon_vma_ctor+0x50/0x50
[  913.304339]  ? page_get_anon_vma+0x80/0x80
[  913.304341]  ? invalid_mkclean_vma+0x20/0x20
[  913.304343]  migrate_pages+0x1276/0x1720
[  913.304346]  ? do_pages_stat+0x310/0x310
[  913.304348]  migrate_misplaced_page+0x5d0/0x820
[  913.304351]  do_huge_pmd_numa_page+0x1f7/0x4b0
[  913.304354]  __handle_mm_fault+0x66a/0x1040
[  913.304358]  handle_mm_fault+0xe4/0x2d0
[  913.304361]  __get_user_pages+0x1ea/0x710
[  913.304363]  get_user_pages_unlocked+0xd0/0x340
[  913.304365]  hva_to_pfn+0xf7/0x440
[  913.304367]  __gfn_to_pfn_memslot+0x7f/0xc0
[  913.304369]  kvm_faultin_pfn+0x95/0x280
[  913.304373]  direct_page_fault+0x201/0x800
[  913.304375]  kvm_tdp_page_fault+0x72/0x80
[  913.304377]  kvm_mmu_page_fault+0x136/0x710
[  913.304379]  ? kvm_complete_insn_gp+0x37/0x40
[  913.304382]  ? svm_complete_emulated_msr+0x52/0x60
[  913.304384]  ? kvm_emulate_wrmsr+0x6c/0x160
[  913.304387]  ? sev_handle_vmgexit+0x115a/0x1600
[  913.304390]  npf_interception+0x50/0xd0
[  913.304391]  svm_invoke_exit_handler+0xf5/0x130 
[  913.304394]  svm_handle_exit+0x11c/0x230
[  913.304396]  vcpu_enter_guest+0x832/0x12e0
[  913.304396]  ? kvm_apic_local_deliver+0x6a/0x70
[  913.304401]  ? kvm_inject_apic_timer_irqs+0x2c/0x70
[  913.304403]  kvm_arch_vcpu_ioctl_run+0x105/0x680

#2: 
[  913.378680] CPU: 79 PID: 6538 Comm: qemu-system-x86 Not tainted 5.19.0-rc5-next-20220706-sev-es-snp+ #380
[  913.378683] Hardware name: AMD Corporation QUARTZ/QUARTZ, BIOS RQZ1002C 09/15/2022
[  913.378685] Call Trace:
[  913.378687]  <TASK>
[  913.378699]  sev_guest_memory_reclaimed.cold+0x18/0x22
[  913.378702]  kvm_arch_guest_memory_reclaimed+0x12/0x20
[  913.378707]  kvm_mmu_notifier_invalidate_range_start+0x2af/0x2e0
[  913.378711]  __mmu_notifier_invalidate_range_start+0x83/0x190
[  913.378715]  change_protection+0x11ec/0x1420
[  913.378720]  ? kvm_release_pfn_clean+0x2f/0x40
[  913.378722]  change_prot_numa+0x66/0xb0
[  913.378724]  task_numa_work+0x22c/0x3b0
[  913.378729]  task_work_run+0x72/0xb0
[  913.378732]  xfer_to_guest_mode_handle_work+0xfc/0x100
[  913.378738]  kvm_arch_vcpu_ioctl_run+0x422/0x680

Additionally, it causes other vCPU threads handling #NPF to block as the above code path(s) are holding
mm->mmap_lock, following are the kstack dumps of the blocked vCPU threads:

[  316.969254] task:qemu-system-x86 state:D stack:    0 pid: 6939 ppid:  6908 flags:0x00000000
[  316.969256] Call Trace:
[  316.969257]  <TASK>
[  316.969258]  __schedule+0x350/0x900
[  316.969262]  schedule+0x52/0xb0
[  316.969265]  rwsem_down_read_slowpath+0x271/0x4b0
[  316.969267]  down_read+0x47/0xa0
[  316.969269]  get_user_pages_unlocked+0x6b/0x340
[  316.969273]  hva_to_pfn+0xf7/0x440
[  316.969277]  __gfn_to_pfn_memslot+0x7f/0xc0
[  316.969279]  kvm_faultin_pfn+0x95/0x280
[  316.969283]  ? kvm_apic_send_ipi+0x9c/0x100
[  316.969287]  direct_page_fault+0x201/0x800
[  316.969290]  kvm_tdp_page_fault+0x72/0x80
[  316.969293]  kvm_mmu_page_fault+0x136/0x710
[  316.969296]  ? xas_load+0x35/0x40
[  316.969299]  ? xas_find+0x187/0x1c0
[  316.969301]  ? xa_find_after+0xf1/0x110
[  316.969304]  ? kvm_pmu_trigger_event+0x5e/0x1e0
[  316.969307]  ? sysvec_call_function+0x52/0x90
[  316.969310]  npf_interception+0x50/0xd0

The invocation of migrate_pages() as in the following code path
does not seem right: 

    do_huge_pmd_numa_page
      migrate_misplaced_page
        migrate_pages

as all the guest memory for SEV/SNP VMs will be pinned/locked, so why is the page migration code path getting invoked at all ?

Thanks,
Ashish