Re: [PATCH v4 2/2] KVM: SVM: Flush cache only on CPUs running SEV guest

Kevin Loughlin <kevinloughlin@xxxxxxxxxx> · Wed, 4 Dec 2024 11:33:35 -0800

On Tue, Dec 3, 2024 at 4:27 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Thu, Apr 11, 2024, Zheyun Shen wrote:
> >
> > +      * that are running a SEV guest and be used in memory reclamation.
> > +      *
> > +      * Migrating vCPUs between pCPUs is tricky. We cannot clear
> > +      * this mask each time reclamation finishes and record it again
> > +      * before VMRUN, because we cannot guarantee the pCPU will exit
> > +      * to VMM before the next reclamation happens.
>
> Migration is easy enough to solve (I think; famous last words).  KVM is already
> forcing an exit to service the IPI, just set the associated pCPU's bit if it has
> a running vCPU loaded.
>
> However, to play nice with multiple flushers, we'd need something like
> kvm_recalculate_apic_map() to ensure subsequent flushers wait for previous
> flushers to finish before reading the cpumask.  Maybe a simple mutex would
> suffice?  Contention should be extremely rare for well-behaved setups.
>
> Kevin, since I believe your use case cares about vCPU migration, is this
> something you'd be interesting in tackling?  It can go on top, i.e. I don't think
> this base series needs to be held up for fancier migration handling, it's a clear
> improvement over blasting WBINVD to all CPUs.

I'm happy to take a look. I agree with your initial assessment of what
needs to be done. I also agree that we don't need to hold up this
series for it, nor should we hold up [0] ("KVM: SEV: Prefer WBNOINVD
over WBINVD for cache maintenance efficiency") (assuming that patch
series checks out, of course).

[0] https://lore.kernel.org/lkml/20241203005921.1119116-1-kevinloughlin@xxxxxxxxxx/

> > @@ -2048,7 +2087,7 @@ int sev_mem_enc_unregister_region(struct kvm *kvm,
> >        * releasing the pages back to the system for use. CLFLUSH will
> >        * not do this, so issue a WBINVD.
> >        */
> > -     wbinvd_on_all_cpus();
> > +     sev_do_wbinvd(kvm);
>
> Hmm, I am not convinced that optimizing sev_mem_enc_unregister_region() is worth
> doing.  Nothing here prevents a vCPU from racing with unregistering the region.
> That said, this path isn't exactly safe as it is, because KVM essentially relies
> on userspace to do the right thing.  And userspace can only corrupt itself,
> because the memory hasn't actually been freed, just unpinned.  If userspace hasn't
> ensured the guest can't access the memory, it's already hosed, so I supposed we
> might as well, because why not.

Yeah, I see your point but likewise vote for "might as well" for now.
There's some additional (arguably orthogonal) cleanup in general that
could be done that I believe is best handled in a separate series
(discussed below).

> > @@ -2152,7 +2191,7 @@ void sev_vm_destroy(struct kvm *kvm)
> >        * releasing the pages back to the system for use. CLFLUSH will
> >        * not do this, so issue a WBINVD.
> >        */
> > -     wbinvd_on_all_cpus();
> > +     sev_do_wbinvd(kvm);
>
> I am 99% certain this wbinvd_on_all_cpus() can simply be dropped.  sev_vm_destroy()
> is called after KVM's mmu_notifier has been unregistered, which means it's called
> after kvm_mmu_notifier_release() => kvm_arch_guest_memory_reclaimed().

I think we need a bit of rework before dropping it (which I propose we
do in a separate series), but let me know if there's a mistake in my
reasoning here...

Right now, sev_guest_memory_reclaimed() issues writebacks for SEV and
SEV-ES guests but does *not* issue writebacks for SEV-SNP guests.
Thus, I believe it's possible a SEV-SNP guest reaches sev_vm_destroy()
with dirty encrypted lines in processor caches. Because SME_COHERENT
doesn't guarantee coherence across CPU-DMA interactions (d45829b351ee
("KVM: SVM: Flush when freeing encrypted pages even on SME_COHERENT
CPUs")), it seems possible that the memory gets re-allocated for DMA,
written back from an (unencrypted) DMA, and then corrupted when the
dirty encrypted version gets written back over that, right?

And potentially the same thing for why we can't yet drop the writeback
in sev_flush_encrypted_page() without a bit of rework?

It's true that the SNP firmware will require WBINVD before
SNP_DF_FLUSH [1], but I think we're only currently doing that when an
ASID is recycled, *not* when an ASID is deactivated.

[1] https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/specifications/56860.pdf