On Friday 01 Apr 2022 at 12:56:50 (-0700), Andy Lutomirski wrote: > On Fri, Apr 1, 2022, at 7:59 AM, Quentin Perret wrote: > > On Thursday 31 Mar 2022 at 09:04:56 (-0700), Andy Lutomirski wrote: > > > > To answer your original question about memory 'conversion', the key > > thing is that the pKVM hypervisor controls the stage-2 page-tables for > > everyone in the system, all guests as well as the host. As such, a page > > 'conversion' is nothing more than a permission change in the relevant > > page-tables. > > > > So I can see two different ways to approach this. > > One is that you split the whole address space in half and, just like SEV and TDX, allocate one bit to indicate the shared/private status of a page. This makes it work a lot like SEV and TDX. > > The other is to have shared and private pages be distinguished only by their hypercall history and the (protected) page tables. This saves some address space and some page table allocations, but it opens some cans of worms too. In particular, the guest and the hypervisor need to coordinate, in a way that the guest can trust, to ensure that the guest's idea of which pages are private match the host's. This model seems a bit harder to support nicely with the private memory fd model, but not necessarily impossible. Right. Perhaps one thing I should clarify as well: pKVM (as opposed to TDX) has only _one_ page-table per guest, and it is controllex by the hypervisor only. So the hypervisor needs to be involved for both shared and private mappings. As such, shared pages have relatively similar constraints when it comes to host mm stuff -- we can't migrate shared pages or swap them out without getting the hypervisor involved. > Also, what are you trying to accomplish by having the host userspace mmap private pages? What I would really like to have is non-destructive in-place conversions of pages. mmap-ing the pages that have been shared back felt like a good fit for the private=>shared conversion, but in fact I'm not all that opinionated about the API as long as the behaviour and the performance are there. Happy to look into alternatives. FWIW, there are a couple of reasons why I'd like to have in-place conversions: - one goal of pKVM is to migrate some things away from the Arm Trustzone environment (e.g. DRM and the likes) and into protected VMs instead. This will give Linux a fighting chance to defend itself against these things -- they currently have access to _all_ memory. And transitioning pages between Linux and Trustzone (donations and shares) is fast and non-destructive, so we really do not want pKVM to regress by requiring the hypervisor to memcpy things; - it can be very useful for protected VMs to do shared=>private conversions. Think of a VM receiving some data from the host in a shared buffer, and then it wants to operate on that buffer without risking to leak confidential informations in a transient state. In that case the most logical thing to do is to convert the buffer back to private, do whatever needs to be done on that buffer (decrypting a frame, ...), and then share it back with the host to consume it; - similar to the previous point, a protected VM might want to temporarily turn a buffer private to avoid ToCToU issues; - once we're able to do device assignment to protected VMs, this might allow DMA-ing to a private buffer, and make it shared later w/o bouncing. And there is probably more. IIUC, the private fd proposal as it stands requires shared and private pages to come from entirely distinct places. So it's not entirely clear to me how any of the above could be supported without having the hypervisor memcpy the data during conversions, which I really don't want to do for performance reasons. > Is the idea that multiple guest could share the same page until such time as one of them tries to write to it? That would certainly be possible to implement in the pKVM environment with the right tracking, so I think it is worth considering as a future goal. Thanks, Quentin