Re: [PATCH 4/8] KVM: gmem: protect kvm_mmu_invalidate_end()

"Kalra, Ashish" <ashish.kalra@xxxxxxx> · Mon, 21 Aug 2023 16:44:37 -0500

Hello Mingwei & Sean,

On 8/18/2023 9:08 PM, Mingwei Zhang wrote:
+Jacky Li

On Fri, Aug 18, 2023 at 3:45 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:

+Mingwei to correct me if I'm wrong

On Fri, Aug 18, 2023, Ashish Kalra wrote:

On 8/18/2023 12:55 PM, Sean Christopherson wrote:
On Tue, Aug 15, 2023, isaku.yamahata@xxxxxxxxx wrote:
From: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>

kvm_mmu_invalidate_end() updates struct kvm::mmu_invalidate_in_progress
and it's protected by kvm::mmu_lock.  call kvm_mmu_invalidate_end() before
unlocking it. Not after the unlock.

Fixes: 8e9009ca6d14 ("KVM: Introduce per-page memory attributes")

This fixes is wrong.  It won't matter in the long run, but it makes my life that
much harder.

Signed-off-by: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>
---
   virt/kvm/kvm_main.c | 15 ++++++++++++++-
   1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8bfeb615fc4d..49380cd62367 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -535,6 +535,7 @@ struct kvm_mmu_notifier_range {
           } arg;
           gfn_handler_t handler;
           on_lock_fn_t on_lock;
+ on_unlock_fn_t before_unlock;
           on_unlock_fn_t on_unlock;

Ugh, shame on my past me.  Having on_lock and on_unlock be asymmetrical with respect
to the lock is nasty.

I would much rather we either (a) be explicit, e.g. before_(un)lock and after_(un)lock,
or (b) have just on_(un)lock, make them symetrical, and handle the SEV mess a
different way.

The SEV hook doesn't actually care about running immediately after unlock, it just
wants to know if there was an overlapping memslot.  It can run after SRCU is dropped,
because even if we make the behavior more precise (right now it blasts WBINVD),
just having a reference to memslots isn't sufficient, the code needs to guarantee
memslots are *stable*.  And that is already guaranteed by the notifier code, i.e.
the SEV code could just reacquire SRCU.

On a separate note here, the SEV hook blasting WBINVD is still causing
serious performance degradation issues with SNP triggered via
AutoNUMA/numad/KSM, etc. With reference to previous discussions related to
it, we have plans to replace WBINVD with CLFLUSHOPT.

Isn't the flush unnecessary when freeing shared memory?  My recollection is that
the problematic scenario is when encrypted memory is freed back to the host,
because KVM already flushes when potentially encrypted mapping memory into the
guest.

With SNP+guest_memfd, private/encrypted memory should be unreachabled via the
hva-based mmu_notifiers.  gmem should have full control of the page lifecycles,
i.e. can get the kernel virtual address as appropriated, and so it SNP shouldn't
need the nuclear option.

E.g. something like this?

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 07756b7348ae..1c6828ae391d 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2328,7 +2328,7 @@ static void sev_flush_encrypted_page(struct kvm_vcpu *vcpu, void *va)

  void sev_guest_memory_reclaimed(struct kvm *kvm)
  {
-       if (!sev_guest(kvm))
+       if (!sev_guest(kvm) || sev_snp_guest(kvm))
                 return;

         wbinvd_on_all_cpus();

I hope this is the final solution :)

So, short answer: no.

SNP+guest_memfd prevent untrusted host user space from directly
modifying the data, this is good enough for CVE-2022-0171, but there
is no such guarantee that the host kernel in some scenarios could
access the data and generate dirty caches. In fact, AFAIC, SNP VM does
not track whether each page is previously shared, isn't it? If a page
was previously shared and was written by the host kernel or devices
before it was changed to private. No one tracks it and dirty caches
are there!

So, to avoid any corner case situations like the above, it seems
currently we have to retain the property: flushing the cache when the
guest memory mapping leaves KVM NPT.

Of course, this is fundamentally because SME_COHERENT only applies to
CPU cores, but not DMA. If SME_COHERENT is complete, flushing is no
longer needed. Alternatively, we need extra bookkeeping for KVM to
know whether each page has dirty cache lines. Another alternative is
to filter mmu_notifier reasons, which is the part that I am planning
to take. thoughts?


Now running SNP+guest_memfd with discard=both option enabled:

# bpftrace -e 'kprobe:sev_guest_memory_reclaimed {@[kstack]=count()}'
Attaching 1 probe...
^C

@[
    sev_guest_memory_reclaimed+5
    kvm_mmu_notifier_release+60
    __mmu_notifier_release+128
    exit_mmap+657
    __mmput+72
    mmput+49
    do_exit+752
    do_group_exit+57
    get_signal+2486
    arch_do_signal_or_restart+51
    exit_to_user_mode_prepare+257
    syscall_exit_to_user_mode+42
    do_syscall_64+109
    entry_SYSCALL_64_after_hwframe+114
]: 1
@[
    sev_guest_memory_reclaimed+5
    kvm_mmu_notifier_invalidate_range_start+869
    __mmu_notifier_invalidate_range_start+152
    change_protection+4628
    change_prot_numa+93
    task_numa_work+588
    task_work_run+108
    exit_to_user_mode_prepare+337
    syscall_exit_to_user_mode+42
    do_syscall_64+109
    entry_SYSCALL_64_after_hwframe+114
]: 2
@[
    sev_guest_memory_reclaimed+5
    kvm_mmu_notifier_invalidate_range_start+869
    __mmu_notifier_invalidate_range_start+152
    change_protection+4628
    change_prot_numa+93
    task_numa_work+588
    task_work_run+108
    xfer_to_guest_mode_handle_work+228
    kvm_arch_vcpu_ioctl_run+1572
    kvm_vcpu_ioctl+671
    __x64_sys_ioctl+153
    do_syscall_64+96
    entry_SYSCALL_64_after_hwframe+114
]: 2
@[
    sev_guest_memory_reclaimed+5
    kvm_set_memslot+740
    __kvm_set_memory_region.part.0+411
    kvm_set_memory_region+89
    kvm_vm_ioctl+1482
    __x64_sys_ioctl+153
    do_syscall_64+96
    entry_SYSCALL_64_after_hwframe+114
]: 104
@[
    sev_guest_memory_reclaimed+5
    kvm_mmu_notifier_invalidate_range_start+869
    __mmu_notifier_invalidate_range_start+152
    zap_page_range_single+384
    unmap_mapping_range+279
    shmem_fallocate+932
    vfs_fallocate+345
    __x64_sys_fallocate+71
    do_syscall_64+96
    entry_SYSCALL_64_after_hwframe+114
]: 5465
@[
    sev_guest_memory_reclaimed+5
    kvm_mmu_notifier_invalidate_range_start+869
    __mmu_notifier_invalidate_range_start+152
    zap_page_range_single+384
    madvise_vma_behavior+1967
    madvise_walk_vmas+190
    do_madvise.part.0+264
    __x64_sys_madvise+98
    do_syscall_64+96
    entry_SYSCALL_64_after_hwframe+114
]: 69677

The maximum hits are seen with shmem_fallocate and madvise, which we 
believe are response to shared<->private
GHCB page-state-chage requests. discard=both handles discard both for
private and shared memory, so freeing shared memory
via fallocate(shared_memfd, FALLOC_FL_PUNCH_HOLE, ...) would trigger the
notifiers when freeing shared pages after guest converts a GPA to
private.

Now, as with SNP+guest_memfd, guest private memory is not mapped in host 
anymore, so i added a generic fix (instead of Sean's proposed patch of 
checking for SNP guest inside sev_guest_memory_reclaimed()):

--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -593,6 +593,9 @@ static __always_inline int 
__kvm_handle_hva_range(struct kvm *kvm,
                        unsigned long hva_start, hva_end;

                        slot = container_of(node, struct 
kvm_memory_slot, hva_node[slots->node_idx]);
+                       if (kvm_slot_can_be_private(slot)) {
+                               continue;
+                       }
                        hva_start = max(range->start, 
slot->userspace_addr);
                        hva_end = min(range->end, slot->userspace_addr +
                                                  (slot->npages << 
PAGE_SHIFT));

With this fix added, the traces are as follows:

# bpftrace -e 'kprobe:sev_guest_memory_reclaimed {@[kstack]=count()}'
Attaching 1 probe...
^C

@[
    sev_guest_memory_reclaimed+5
    kvm_mmu_notifier_invalidate_range_start+812
    __mmu_notifier_invalidate_range_start+152
    change_protection+4628
    change_prot_numa+93
    task_numa_work+588
    task_work_run+108
    exit_to_user_mode_prepare+337
    syscall_exit_to_user_mode+42
    do_syscall_64+109
    entry_SYSCALL_64_after_hwframe+114
]: 1
@[
    sev_guest_memory_reclaimed+5
    kvm_mmu_notifier_release+60
    __mmu_notifier_release+128
    exit_mmap+657
    __mmput+72
    mmput+49
    do_exit+752
    do_group_exit+57
    get_signal+2486
    arch_do_signal_or_restart+51
    exit_to_user_mode_prepare+257
    syscall_exit_to_user_mode+42
    do_syscall_64+109
    entry_SYSCALL_64_after_hwframe+114
]: 1
@[
    sev_guest_memory_reclaimed+5
    kvm_mmu_notifier_invalidate_range_start+812
    __mmu_notifier_invalidate_range_start+152
    change_protection+4628
    change_prot_numa+93
    task_numa_work+588
    task_work_run+108
    xfer_to_guest_mode_handle_work+228
    kvm_arch_vcpu_ioctl_run+1572
    kvm_vcpu_ioctl+671
    __x64_sys_ioctl+153
    do_syscall_64+96
    entry_SYSCALL_64_after_hwframe+114
]:
@[
    sev_guest_memory_reclaimed+5
    kvm_set_memslot+740
    __kvm_set_memory_region.part.0+411
    kvm_set_memory_region+89
    kvm_vm_ioctl+1482
    __x64_sys_ioctl+153
    do_syscall_64+96
    entry_SYSCALL_64_after_hwframe+114
]: 104
#

As expected, the SEV hook is not invoked for the guest private memory 
pages (no more invalidation from shmem_fallocate() + madvise()).

Isn't it better to skip invoking the KVM MMU invalidation notifier when 
the invalidated range belongs to guest private memory ?

> In fact, AFAIC, SNP VM does
> not track whether each page is previously shared, isn't it? If a page
> was previously shared and was written by the host kernel or devices
> before it was changed to private. No one tracks it and dirty caches
> are there!

The skipped invalidation here covered the case Mingwei mentioned above, 
where the pages are changed from private->shared and subsequent freeing 
of shared pages triggered the invalidation.

But, then why are we concerned about this, i thought we have concerns 
about the case where the dirty cache lines contain encrypted guest data ?

Thanks,
Ashish