On Thu, 2024-08-08 at 23:16 +0100, Elliot Berman wrote > On Thu, Aug 08, 2024 at 02:05:55PM +0100, Patrick Roy wrote: >> On Wed, 2024-08-07 at 20:06 +0100, Elliot Berman wrote: >>>>>>>> struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags) >>>>>>>> { >>>>>>>> + unsigned long gmem_flags = (unsigned long)file->private_data; >>>>>>>> struct inode *inode = file_inode(file); >>>>>>>> struct guest_memfd_operations *ops = inode->i_private; >>>>>>>> struct folio *folio; >>>>>>>> @@ -43,6 +89,12 @@ struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags >>>>>>>> goto out_err; >>>>>>>> } >>>>>>>> >>>>>>>> + if (gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP) { >>>>>>>> + r = guest_memfd_folio_private(folio); >>>>>>>> + if (r) >>>>>>>> + goto out_err; >>>>>>>> + } >>>>>>>> + >>>>>>> >>>>>>> How does a caller of guest_memfd_grab_folio know whether a folio needs >>>>>>> to be removed from the direct map? E.g. how can a caller know ahead of >>>>>>> time whether guest_memfd_grab_folio will return a freshly allocated >>>>>>> folio (which thus needs to be removed from the direct map), vs a folio >>>>>>> that already exists and has been removed from the direct map (probably >>>>>>> fine to remove from direct map again), vs a folio that already exists >>>>>>> and is currently re-inserted into the direct map for whatever reason >>>>>>> (must not remove these from the direct map, as other parts of >>>>>>> KVM/userspace probably don't expect the direct map entries to disappear >>>>>>> from underneath them). I couldn't figure this one out for my series, >>>>>>> which is why I went with hooking into the PG_uptodate logic to always >>>>>>> remove direct map entries on freshly allocated folios. >>>>>>> >>>>>> >>>>>> gmem_flags come from the owner. If the caller (in non-CoCo case) wants >>>> >>>> Ah, oops, I got it mixed up with the new `flags` parameter. >>>> >>>>>> to restore the direct map right away, it'd have to be a direct >>>>>> operation. As an optimization, we could add option that asks for page in >>>>>> "shared" state. If allocating new page, we can return it right away >>>>>> without removing from direct map. If grabbing existing folio, it would >>>>>> try to do the private->shared conversion. >>>> >>>> My concern is more with the implicit shared->private conversion that >>>> happens on every call to guest_memfd_grab_folio (and thus >>>> kvm_gmem_get_pfn) when grabbing existing folios. If something else >>>> marked the folio as shared, then we cannot punch it out of the direct >>>> map again until that something is done using the folio (when working on >>>> my RFC, kvm_gmem_get_pfn was indeed called on existing folios that were >>>> temporarily marked shared, as I was seeing panics because of this). And >>>> if the folio is currently private, there's nothing to do. So either way, >>>> guest_memfd_grab_folio shouldn't touch the direct map entry for existing >>>> folios. >>>> >>> >>> What I did could be documented/commented better. >> >> No worries, thanks for taking the time to walk me through understanding >> it! >> >>> If ops->accessible() is *not* provided, all guest_memfd allocations will >>> immediately remove from direct map and treat them immediately like guest >>> private (goal is to match what KVM does today on tip). >> >> Ah, so if ops->accessible() is not provided, then there will never be >> any shared memory inside gmem (like today, where gmem doesn't support >> shared memory altogether), and thus there's no problems with just >> unconditionally doing set_direct_map_invalid_noflush in >> guest_memfd_grab_folio, because all existing folios already have their >> direct map entry removed. Got it! >> >>> If ops->accessible() is provided, then guest_memfd allocations start >>> as "shared" and KVM/Gunyah need to do the shared->private conversion >>> when they want to do the private conversion on the folio. "Shared" is >>> the default because that is effectively a no-op. >>> For the non-CoCo case you're interested in, we'd have the >>> ops->accessible() provided and we wouldn't pull out the direct map from >>> gpc. >> >> So in pKVM/Gunyah's case, guest memory starts as shared, and at some >> point the guest will issue a hypercall (or similar) to flip it to >> private, at which point it'll get removed from the direct map? >> >> That isn't really what we want for our case. We consider the folios as >> private straight away, as we do not let the guest control their state at >> all. Everything is always "accessible" to both KVM and userspace in the >> sense that they can just flip gfns to shared as they please without the >> guest having any say in it. >> >> I think we should untangle the behavior of guest_memfd_grab_folio from >> the presence of ops->accessible. E.g. instead of direct map removal >> being dependent on ops->accessible we should have some >> GRAB_FOLIO_RETURN_SHARED flag for gmem_flags, which is set for y'all, >> and not set for us (I don't think we should have a "call >> set_direct_map_invalid_noflush unconditionally in >> guest_memfd_grab_folio" mode at all, because if sharing gmem is >> supported, then that is broken, and if sharing gmem is not supported >> then only removing direct map entries for freshly allocated folios gets >> us the same result of "all folios never in the direct map" while >> avoiding some no-op direct map operations). >> >> Because we would still use ->accessible, albeit for us that would be >> more for bookkeeping along the lines of "which gfns does userspace >> currently require to be in the direct map?". I haven't completely >> thought it through, but what I could see working for us would be a pair >> of ioctls for marking ranges accessible/inaccessible, with >> "accessibility" stored in some xarray (somewhat like Fuad's patches, I >> guess? [1]). >> >> In a world where we have a "sharing refcount", the "make accessible" >> ioctl reinserts into the direct map (if needed), lifts the "sharings >> refcount" for each folio in the given gfn range, and marks the range as >> accessible. And the "make inaccessible" ioctl would first check that >> userspace has unmapped all those gfns again, and if yes, mark them as >> inaccessible, drop the "sharings refcount" by 1 for each, and removes >> from the direct map again if it held the last reference (if userspace >> still has some gfns mapped, the ioctl would just fail). >> > > I am warming up to the sharing refcount idea. How does the sharing > refcount look for kvm gpc? I've come up with the below rough draft (written as a new commit on top of my RFC series [1], with some bits from your patch copied in). With this, I was able to actually boot a Firecracker VM with multiple vCPUs (which previously didn't work because of different vCPUs putting their kvm-clock structures into the same guest page). Best, Patrick [1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@xxxxxxxxxxxx/T/#ma44793da6bc000a2c22b1ffe37292b9615881838 ---