Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

Quentin Perret <qperret@xxxxxxxxxx> · Mon, 4 Apr 2022 15:01:21 +0000

On Friday 01 Apr 2022 at 12:56:50 (-0700), Andy Lutomirski wrote:
> On Fri, Apr 1, 2022, at 7:59 AM, Quentin Perret wrote:
> > On Thursday 31 Mar 2022 at 09:04:56 (-0700), Andy Lutomirski wrote:
> 
> 
> > To answer your original question about memory 'conversion', the key
> > thing is that the pKVM hypervisor controls the stage-2 page-tables for
> > everyone in the system, all guests as well as the host. As such, a page
> > 'conversion' is nothing more than a permission change in the relevant
> > page-tables.
> >
> 
> So I can see two different ways to approach this.
> 
> One is that you split the whole address space in half and, just like SEV and TDX, allocate one bit to indicate the shared/private status of a page.  This makes it work a lot like SEV and TDX.
>
> The other is to have shared and private pages be distinguished only by their hypercall history and the (protected) page tables.  This saves some address space and some page table allocations, but it opens some cans of worms too.  In particular, the guest and the hypervisor need to coordinate, in a way that the guest can trust, to ensure that the guest's idea of which pages are private match the host's.  This model seems a bit harder to support nicely with the private memory fd model, but not necessarily impossible.

Right. Perhaps one thing I should clarify as well: pKVM (as opposed to
TDX) has only _one_ page-table per guest, and it is controllex by the
hypervisor only. So the hypervisor needs to be involved for both shared
and private mappings. As such, shared pages have relatively similar
constraints when it comes to host mm stuff --  we can't migrate shared
pages or swap them out without getting the hypervisor involved.

> Also, what are you trying to accomplish by having the host userspace mmap private pages?

What I would really like to have is non-destructive in-place conversions
of pages. mmap-ing the pages that have been shared back felt like a good
fit for the private=>shared conversion, but in fact I'm not all that
opinionated about the API as long as the behaviour and the performance
are there. Happy to look into alternatives.

FWIW, there are a couple of reasons why I'd like to have in-place
conversions:

 - one goal of pKVM is to migrate some things away from the Arm
   Trustzone environment (e.g. DRM and the likes) and into protected VMs
   instead. This will give Linux a fighting chance to defend itself
   against these things -- they currently have access to _all_ memory.
   And transitioning pages between Linux and Trustzone (donations and
   shares) is fast and non-destructive, so we really do not want pKVM to
   regress by requiring the hypervisor to memcpy things;

 - it can be very useful for protected VMs to do shared=>private
   conversions. Think of a VM receiving some data from the host in a
   shared buffer, and then it wants to operate on that buffer without
   risking to leak confidential informations in a transient state. In
   that case the most logical thing to do is to convert the buffer back
   to private, do whatever needs to be done on that buffer (decrypting a
   frame, ...), and then share it back with the host to consume it;

 - similar to the previous point, a protected VM might want to
   temporarily turn a buffer private to avoid ToCToU issues;

 - once we're able to do device assignment to protected VMs, this might
   allow DMA-ing to a private buffer, and make it shared later w/o
   bouncing.

And there is probably more.

IIUC, the private fd proposal as it stands requires shared and private
pages to come from entirely distinct places. So it's not entirely clear
to me how any of the above could be supported without having the
hypervisor memcpy the data during conversions, which I really don't want
to do for performance reasons.

> Is the idea that multiple guest could share the same page until such time as one of them tries to write to it?

That would certainly be possible to implement in the pKVM
environment with the right tracking, so I think it is worth considering
as a future goal.

Thanks,
Quentin