Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private

Peter Xu <peterx@xxxxxxxxxx> · Wed, 16 Oct 2024 16:16:17 -0400

On Wed, Oct 16, 2024 at 10:45:43AM +0200, David Hildenbrand wrote:
> On 16.10.24 01:42, Ackerley Tng wrote:
> > Peter Xu <peterx@xxxxxxxxxx> writes:
> > 
> > > On Fri, Oct 11, 2024 at 11:32:11PM +0000, Ackerley Tng wrote:
> > > > Peter Xu <peterx@xxxxxxxxxx> writes:
> > > > 
> > > > > On Tue, Sep 10, 2024 at 11:43:57PM +0000, Ackerley Tng wrote:
> > > > > > The faultability xarray is stored on the inode since faultability is a
> > > > > > property of the guest_memfd's memory contents.
> > > > > > 
> > > > > > In this RFC, presence of an entry in the xarray indicates faultable,
> > > > > > but this could be flipped so that presence indicates unfaultable. For
> > > > > > flexibility, a special value "FAULT" is used instead of a simple
> > > > > > boolean.
> > > > > > 
> > > > > > However, at some stages of a VM's lifecycle there could be more
> > > > > > private pages, and at other stages there could be more shared pages.
> > > > > > 
> > > > > > This is likely to be replaced by a better data structure in a future
> > > > > > revision to better support ranges.
> > > > > > 
> > > > > > Also store struct kvm_gmem_hugetlb in struct kvm_gmem_hugetlb as a
> > > > > > pointer. inode->i_mapping->i_private_data.
> > > > > 
> > > > > Could you help explain the difference between faultability v.s. the
> > > > > existing KVM_MEMORY_ATTRIBUTE_PRIVATE?  Not sure if I'm the only one who's
> > > > > confused, otherwise might be good to enrich the commit message.
> > > > 
> > > > Thank you for this question, I'll add this to the commit message to the
> > > > next revision if Fuad's patch set [1] doesn't make it first.
> > > > 
> > > > Reason (a): To elaborate on the explanation in [1],
> > > > KVM_MEMORY_ATTRIBUTE_PRIVATE is whether userspace wants this page to be
> > > > private or shared, and faultability is whether the page is allowed to be
> > > > faulted in by userspace.
> > > > 
> > > > These two are similar but may not be the same thing. In pKVM, pKVM
> > > > cannot trust userspace's configuration of private/shared, and other
> > > > information will go into determining the private/shared setting in
> > > > faultability.
> > > 
> > > It makes sense to me that the kernel has the right to decide which page is
> > > shared / private.  No matter if it's for pKVM or CoCo, I believe the normal
> > > case is most / all pages are private, until some requests to share them for
> > > special purposes (like DMA).  But that'll need to be initiated as a request
> > > from the guest not the userspace hypervisor.
> > 
> > For TDX, the plan is that the guest will request the page to be remapped
> > as shared or private, and the handler for that request will exit to
> > the userspace VMM.
> > 
> > The userspace VMM will then do any necessary coordination (e.g. for a
> > shared to private conversion it may need to unpin pages from DMA), and
> > then use the KVM_SET_MEMORY_ATTRIBUTES ioctl to indicate agreement with
> > the guest's requested conversion. This is where
> > KVM_MEMORY_ATTRIBUTE_PRIVATE will be provided.
> > 
> > Patch 38 [1] updates
> > tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c to
> > demonstrate the usage flow for x86.
> > 
> > Fuad will be in a better position to explain the flow for pKVM.
> > 
> > > I must confess I totally have no idea how KVM_MEMORY_ATTRIBUTE_PRIVATE is
> > > planned to be used in the future. Currently it's always set at least in
> > > QEMU if gmemfd is enabled, so it doesn't yet tell me anything..
> > > 
> > > If it's driven by the userspace side of the hypervisor, I wonder when
> > > should the user app request some different value it already was, if the
> > > kernel already has an answer in this case.  It made me even more confused,
> > > as we have this in the API doc:
> > > 
> > >          Note, there is no "get" API.  Userspace is responsible for
> > >          explicitly tracking the state of a gfn/page as needed.
> > > 
> > > And I do wonder whether we will still need some API just to query whether
> > > the kernel allows the page to be mapped or not (aka, the "real" shared /
> > > private status of a guest page).  I guess that's not directly relevant to
> > > the faultability to be introduced here, but if you or anyone know please
> > > kindly share, I'd love to learn about it.
> > 
> > The userspace VMM will track the initial shared/private state, in the
> > sense that when the VM is created, the mem_attr_array is initialized
> > such that the guest pages are all shared.
> > 
> > Then when the userspace VMM calls the KVM_SET_MEMORY_ATTRIBUTES ioctl,
> > it should record all changes so it knows what the state is in the
> > kernel.
> > 
> > Even if userspace VMM doesn't record the state properly, if the
> > KVM_SET_MEMORY_ATTRIBUTES ioctl is used to request no change
> > (e.g. setting an already private page to private), it will just be a
> > no-op in the kernel.
> > 
> > > > 
> > > > Perhaps Fuad can elaborate more here.
> > > > 
> > > > Reason (b): In this patch series (mostly focus on x86 first), we're
> > > > using faultability to prevent any future faults before checking that
> > > > there are no mappings.
> > > > 
> > > > Having a different xarray from mem_attr_array allows us to disable
> > > > faulting before committing to changing mem_attr_array. Please see
> > > > `kvm_gmem_should_set_attributes_private()` in this patch [2].
> > > > 
> > > > We're not completely sure about the effectiveness of using faultability
> > > > to block off future faults here, in future revisions we may be using a
> > > > different approach. The folio_lock() is probably important if we need to
> > > > check mapcount. Please let me know if you have any ideas!
> > > > 
> > > > The starting point of having a different xarray was pKVM's requirement
> > > > of having separate xarrays, and we later realized that the xarray could
> > > > be used for reason (b). For x86 we could perhaps eventually remove the
> > > > second xarray? Not sure as of now.
> > > 
> > > Just had a quick look at patch 27:
> > > 
> > > https://lore.kernel.org/all/5a05eb947cf7aa21f00b94171ca818cc3d5bdfee.1726009989.git.ackerleytng@xxxxxxxxxx/
> > > 
> > > I'm not yet sure what's protecting from faultability being modified against
> > > a concurrent fault().
> > > 
> > > I wonder whether one can use the folio lock to serialize that, so that one
> > > needs to take the folio lock to modify/lookup the folio's faultability,
> > > then it may naturally match with the fault() handler design, where
> > > kvm_gmem_get_folio() needs to lock the page first.
> > > 
> > > But then kvm_gmem_is_faultable() will need to also be called only after the
> > > folio is locked to avoid races.
> > 
> > My bad. In our rush to get this series out before LPC, the patch series
> > was not organized very well. Patch 39 [2] adds the
> > lock. filemap_invalidate_lock_shared() should make sure that faulting
> > doesn't race with faultability updates.
> > 
> > > > > The latter is per-slot, so one level higher, however I don't think it's a
> > > > > common use case for mapping the same gmemfd in multiple slots anyway for
> > > > > KVM (besides corner cases like live upgrade).  So perhaps this is not about
> > > > > layering but something else?  For example, any use case where PRIVATE and
> > > > > FAULTABLE can be reported with different values.
> > > > > 
> > > > > Another higher level question is, is there any plan to support non-CoCo
> > > > > context for 1G?
> > > > 
> > > > I believe guest_memfd users are generally in favor of eventually using
> > > > guest_memfd for non-CoCo use cases, which means we do want 1G (shared,
> > > > in the case of CoCo) page support.
> > > > 
> > > > However, core-mm's fault path does not support mapping at anything
> > > > higher than the PMD level (other than hugetlb_fault(), which the
> > > > community wants to move away from), so core-mm wouldn't be able to map
> > > > 1G pages taken from HugeTLB.
> > > 
> > > Have you looked at vm_operations_struct.huge_fault()?  Or maybe you're
> > > referring to some other challenges?
> > > 
> > 
> > IIUC vm_operations_struct.huge_fault() is used when creating a PMD, but
> > PUD mappings will be needed for 1G pages, so 1G pages can't be mapped by
> > core-mm using vm_operations_struct.huge_fault().
> 
> 
> Just to clarify a bit for Peter: as has been discussed previously, there are
> rather big difference between CoCo and non-CoCo VMs.
> 
> In CoCo VMs, the primary portion of all pages are private, and they are not
> mapped into user space. Only a handful of pages are commonly shared and
> mapped into user space.
> 
> In non-CoCo VMs, all pages are shared and (for the time being) all pages are
> mapped into user space from where KVM will consume them.
> 
> 
> Installing pmd/pud mappings into user space (recall: shared memory only) is
> currently not really a requirement for CoCo VMs, and therefore not the focus
> of this work.
> 
> Further, it's currently considered to be incompatible with getting in-place
> private<->share conversion on *page* granularity right, as we will be
> exposing huge/gigantic folios via individual small folios to core-MM.
> Mapping a PMD/PUD into core-mm, that is composed of multiple folios is not
> going to fly, unless using a PFNMAP, which has been briefly discussed as
> well, bu disregarded so far (no page pinning support).
> 
> So in the context of this work here, huge faults and PUD/PMD *user space
> page tables* do not apply.
> 
> For non-CoCo VMs there is no in-place conversion problem. One could use the
> same CoCo implementation, but without user space pud/pmd mappings. KVM and
> VFIO would have to consume this memory via the guest_memfd in memslots
> instead of via the user space mappings to more easily get PMD/PUD mappings
> into the secondary MMU. And the downsides would be sacrificing the vmemmap

Is there chance that when !CoCo will be supported, then external modules
(e.g. VFIO) can reuse the old user mappings, just like before gmemfd?

To support CoCo, I understand gmem+offset is required all over the places.
However in a non-CoCo context, I wonder whether the other modules are
required to stick with gmem+offset, or they can reuse the old VA ways,
because how it works can fundamentally be the same as before, except that
the folios now will be managed by gmemfd.

I think the good thing with such approach is when developing CoCo support
for all these modules, there's less constraints / concerns to be compatible
with non-CoCo use case, also it'll make it even easier to be used in
production before all CoCo facilities ready, as most infrastructures are
already around and being used for years if VA can be mapped and GUPed like
before.

Thanks,

> optimization and PMD/PUD user space mappings, while at the same time
> benefiting from being able to easily map only parts of a huge/gigantic page
> into user space.
> 
> 
> So I consider pmd/pud user space mappings for non-CoCo an independent work
> item, not something that is part of the current effort of huge/gigantic
> pages with in-place conversion at page granularity for CoCo VMs.
> 
> 
> More information is available in the bi-weekly upstream MM meeting (that was
> recorded) and the LPC talks, where most of that has been discussed.
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

-- 
Peter Xu