Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages

David Hildenbrand <david@xxxxxxxxxx> · Mon, 19 Apr 2021 20:12:06 +0200

On 19.04.21 20:09, Sean Christopherson wrote:
On Mon, Apr 19, 2021, Kirill A. Shutemov wrote:
On Mon, Apr 19, 2021 at 04:01:46PM +0000, Sean Christopherson wrote:
But fundamentally the private pages, are well, private.  They can't be shared
across processes, so I think we could (should?) require the VMA to always be
MAP_PRIVATE.  Does that buy us enough to rely on the VMA alone?  I.e. is that
enough to prevent userspace and unaware kernel code from acquiring a reference
to the underlying page?

Shared pages should be fine too (you folks wanted tmpfs support).

Is that a conflict though?  If the private->shared conversion request is kicked
out to userspace, then userspace can re-mmap() the files as MAP_SHARED, no?

Allowing MAP_SHARED for guest private memory feels wrong.  The data can't be
shared, and dirty data can't be written back to the file.

The poisoned pages must be useless outside of the process with the blessed
struct kvm. See kvm_pfn_map in the patch.

The big requirement for kernel TDX support is that the pages are useless in the
host.  Regarding the guest, for TDX, the TDX Module guarantees that at most a
single KVM guest can have access to a page at any given time.  I believe the RMP
provides the same guarantees for SEV-SNP.

SEV/SEV-ES could still end up with corruption if multiple guests map the same
private page, but that's obviously not the end of the world since it's the status
quo today.  Living with that shortcoming might be a worthy tradeoff if punting
mutual exclusion between guests to firmware/hardware allows us to simplify the
kernel implementation.

  - Add a new GUP flag to retrive such pages from the userspace mapping.
    Used only for private mapping population.

  - Shared gfn ranges managed by userspace, based on hypercalls from the
    guest.

  - Shared mappings get populated via normal VMA. Any poisoned pages here
    would lead to SIGBUS.

So far it looks pretty straight-forward.

The only thing that I don't understand is at way point the page gets tied
to the KVM instance. Currently we do it just before populating shadow
entries, but it would not work with the new scheme: as we poison pages
on fault it they may never get inserted into shadow entries. That's not
good as we rely on the info to unpoison page on free.

Can you elaborate on what you mean by "unpoison"?  If the page is never actually
mapped into the guest, then its poisoned status is nothing more than a software
flag, i.e. nothing extra needs to be done on free.

Normally, poisoned flag preserved for freed pages as it usually indicate
hardware issue. In this case we need return page to the normal circulation.
So we need a way to differentiate two kinds of page poison. Current patch
does this by adding page's pfn to kvm_pfn_map. But this will not work if
we uncouple poisoning and adding to shadow PTE.

Why use PG_hwpoison then?

I already raised that reusing PG_hwpoison is not what we want. And I 
repeat, to me this all looks like a big hack; some things you (Sena) 
propose sound cleaner, at least to me.

--
Thanks,

David / dhildenb