On Thu, Aug 08, 2024 at 11:55:15PM +0200, David Hildenbrand wrote: > On 08.08.24 23:41, Elliot Berman wrote: > > On Wed, Aug 07, 2024 at 06:12:00PM +0200, David Hildenbrand wrote: > > > On 06.08.24 19:14, Elliot Berman wrote: > > > > On Tue, Aug 06, 2024 at 03:51:22PM +0200, David Hildenbrand wrote: > > > > > > - if (gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP) { > > > > > > + if (!ops->accessible && (gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP)) { > > > > > > r = guest_memfd_folio_private(folio); > > > > > > if (r) > > > > > > goto out_err; > > > > > > @@ -107,6 +109,82 @@ struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags > > > > > > } > > > > > > EXPORT_SYMBOL_GPL(guest_memfd_grab_folio); > > > > > > +int guest_memfd_make_inaccessible(struct file *file, struct folio *folio) > > > > > > +{ > > > > > > + unsigned long gmem_flags = (unsigned long)file->private_data; > > > > > > + unsigned long i; > > > > > > + int r; > > > > > > + > > > > > > + unmap_mapping_folio(folio); > > > > > > + > > > > > > + /** > > > > > > + * We can't use the refcount. It might be elevated due to > > > > > > + * guest/vcpu trying to access same folio as another vcpu > > > > > > + * or because userspace is trying to access folio for same reason > > > > > > > > > > As discussed, that's insufficient. We really have to drive the refcount to 1 > > > > > -- the single reference we expect. > > > > > > > > > > What is the exact problem you are running into here? Who can just grab a > > > > > reference and maybe do nasty things with it? > > > > > > > > > > > > > Right, I remember we had discussed it. The problem I faced was if 2 > > > > vcpus fault on same page, they would race to look up the folio in > > > > filemap, increment refcount, then try to lock the folio. One of the > > > > vcpus wins the lock, while the other waits. The vcpu that gets the > > > > lock vcpu will see the elevated refcount. > > > > > > > > I was in middle of writing an explanation why I think this is best > > > > approach and realized I think it should be possible to do > > > > shared->private conversion and actually have single reference. There > > > > would be some cost to walk through the allocated folios and convert them > > > > to private before any vcpu runs. The approach I had gone with was to > > > > do conversions as late as possible. > > > > > > We certainly have to support conversion while the VCPUs are running. > > > > > > The VCPUs might be able to avoid grabbing a folio reference for the > > > conversion and only do the folio_lock(): as long as we have a guarantee that > > > we will disallow freeing the folio in gmem, for example, by syncing against > > > FALLOC_FL_PUNCH_HOLE. > > > > > > So if we can rely on the "gmem" reference to the folio that cannot go away > > > while we do what we do, we should be fine. > > > > > > <random though> > > > > > > Meanwhile, I was thinking if we would want to track the references we > > > hand out to "safe" users differently. > > > > > > Safe references would only be references that would survive a > > > private<->shared conversion, like KVM MMU mappings maybe? > > > > > > KVM would then have to be thought to return these gmem references > > > differently. > > > > > > The idea would be to track these "safe" references differently > > > (page->private?) and only allow dropping *our* guest_memfd reference if all > > > these "safe" references are gone. That is, FALLOC_FL_PUNCH_HOLE would also > > > fail if there are any "safe" reference remaining. > > > > > > <\random though> > > > > > > > I didn't find a path in filemap where we can grab folio without > > increasing its refcount. I liked the idea of keeping track of a "safe" > > refcount, but I believe there is a small window to race comparing the > > main folio refcount and the "safe" refcount. > > There are various possible models. To detect unexpected references, we could > either use > > folio_ref_count(folio) == gmem_folio_safe_ref_count(folio) + 1 > > [we increment both ref counter] > > or > > folio_ref_count(folio) == 1 > > [we only increment the safe refcount and let other magic handle it as > described] > > A vcpu could have > > incremented the main folio refcount and on the way to increment the safe > > refcount. Before that happens, another thread does the comparison and > > sees a mismatch. > > Likely there won't be a way around coming up with code that is able to deal > with such temporary, "speculative" folio references. > > In the simplest case, these references will be obtained from our gmem code > only, and we'll have to detect that it happened and retry (a seqcount would > be a naive solution). > > In the complex case, these references are temporarily obtained from other > core-mm code -- using folio_try_get(). We can minimize some of them > (speculative references from GUP or the pagecache), and try optimizing > others (PFN walkers like page migration). > > But likely we'll need some retry magic, at least initially. > I thought retry magic would not fly. I'll try this out. Thanks, Elliot