On Wed, Aug 07, 2024 at 11:57:35AM +0100, Patrick Roy wrote: > On Wed, 2024-08-07 at 07:48 +0100, Patrick Roy wrote: > > On Tue, 2024-08-06 at 21:13 +0100, Elliot Berman wrote: > >> On Tue, Aug 06, 2024 at 04:39:24PM +0100, Patrick Roy wrote: > >>> On the other hand, as Paolo pointed out in my patches [1], just using a > >>> page flag to track direct map presence for gmem is not enough. We > >>> actually need to keep a refcount in folio->private to keep track of how > >>> many different actors request a folio's direct map presence (in the > >>> specific case in my patch series, it was different pfn_to_gfn_caches for > >>> the kvm-clock structures of different vcpus, which the guest can place > >>> into the same gfn). While this might not be a concern for the the > >>> pKVM/Gunyah case, where the guest dictates memory state, it's required > >>> for the non-CoCo case where KVM/userspace can set arbitrary guest gfns > >>> to shared if it needs/wants to access them for whatever reason. So for > >>> this we'd need to have PG_private=1 mean "direct map entry restored" (as > >>> if PG_private=0, there is no folio->private). > >>> > >>> [1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@xxxxxxxxxxxx/T/#m0608c4b6a069b3953d7ee97f48577d32688a3315 > >>> > >> > >> I wonder if we can use the folio refcount itself, assuming we can rely > >> on refcount == 1 means we can do shared->private conversion. > >> > >> In gpc_map_gmem, we convert private->shared. There's no problem here in > >> the non-CoCo case. > >> > >> In gpc_unmap, we *try* to convert back from shared->private. If > >> refcount>2, then the conversion would fail. The last gpc_unmap would be > >> able to successfully convert back to private. > >> > >> Do you see any concerns with this approach? > > > > The gfn_to_pfn_cache does not keep an elevated refcount on the cached > > page, and instead responds to MMU notifiers to detect whether the cached > > translation has been invalidated, iirc. So the folio refcount will > > not reflect the number of gpcs holding that folio. > > Ah, fair enough. This is kinda like a GUP pin which would prevent us from making page private, but without the pin part. [...] > >>>> struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags) > >>>> { > >>>> + unsigned long gmem_flags = (unsigned long)file->private_data; > >>>> struct inode *inode = file_inode(file); > >>>> struct guest_memfd_operations *ops = inode->i_private; > >>>> struct folio *folio; > >>>> @@ -43,6 +89,12 @@ struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags > >>>> goto out_err; > >>>> } > >>>> > >>>> + if (gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP) { > >>>> + r = guest_memfd_folio_private(folio); > >>>> + if (r) > >>>> + goto out_err; > >>>> + } > >>>> + > >>> > >>> How does a caller of guest_memfd_grab_folio know whether a folio needs > >>> to be removed from the direct map? E.g. how can a caller know ahead of > >>> time whether guest_memfd_grab_folio will return a freshly allocated > >>> folio (which thus needs to be removed from the direct map), vs a folio > >>> that already exists and has been removed from the direct map (probably > >>> fine to remove from direct map again), vs a folio that already exists > >>> and is currently re-inserted into the direct map for whatever reason > >>> (must not remove these from the direct map, as other parts of > >>> KVM/userspace probably don't expect the direct map entries to disappear > >>> from underneath them). I couldn't figure this one out for my series, > >>> which is why I went with hooking into the PG_uptodate logic to always > >>> remove direct map entries on freshly allocated folios. > >>> > >> > >> gmem_flags come from the owner. If the caller (in non-CoCo case) wants > > Ah, oops, I got it mixed up with the new `flags` parameter. > > >> to restore the direct map right away, it'd have to be a direct > >> operation. As an optimization, we could add option that asks for page in > >> "shared" state. If allocating new page, we can return it right away > >> without removing from direct map. If grabbing existing folio, it would > >> try to do the private->shared conversion. > > My concern is more with the implicit shared->private conversion that > happens on every call to guest_memfd_grab_folio (and thus > kvm_gmem_get_pfn) when grabbing existing folios. If something else > marked the folio as shared, then we cannot punch it out of the direct > map again until that something is done using the folio (when working on > my RFC, kvm_gmem_get_pfn was indeed called on existing folios that were > temporarily marked shared, as I was seeing panics because of this). And > if the folio is currently private, there's nothing to do. So either way, > guest_memfd_grab_folio shouldn't touch the direct map entry for existing > folios. > What I did could be documented/commented better. If ops->accessible() is *not* provided, all guest_memfd allocations will immediately remove from direct map and treat them immediately like guest private (goal is to match what KVM does today on tip). If ops->accessible() is provided, then guest_memfd allocations start as "shared" and KVM/Gunyah need to do the shared->private conversion when they want to do the private conversion on the folio. "Shared" is the default because that is effectively a no-op. For the non-CoCo case you're interested in, we'd have the ops->accessible() provided and we wouldn't pull out the direct map from gpc. Thanks, Elliot