On Sat, Aug 17, 2024 at 12:03:50AM +0200, David Hildenbrand wrote: > On 16.08.24 23:52, Ackerley Tng wrote: > > David Hildenbrand <david@xxxxxxxxxx> writes: > > > > > On 16.08.24 19:45, Ackerley Tng wrote: > > > > > > > > <snip> > > > > > > > > IIUC folio_lock() isn't a prerequisite for taking a refcount on the > > > > folio. > > > > > > Right, to do folio_lock() you only have to guarantee that the folio > > > cannot get freed concurrently. So you piggyback on another reference > > > (you hold indirectly). > > > > > > > > > > > Even if we are able to figure out a "safe" refcount, and check that the > > > > current refcount == "safe" refcount before removing from direct map, > > > > what's stopping some other part of the kernel from taking a refcount > > > > just after the check happens and causing trouble with the folio's > > > > removal from direct map? > > > > > > Once the page was unmapped from user space, and there were no additional > > > references (e.g., GUP, whatever), any new references can only be > > > (should, unless BUG :) ) temporary speculative references that should > > > not try accessing page content, and that should back off if the folio is > > > not deemed interesting or cannot be locked. (e.g., page > > > migration/compaction/offlining). > > > > I thought about it again - I think the vmsplice() cases are taken care > > of once we check that the folios are not mapped into userspace, since > > vmsplice() reads from a mapping. > > > > splice() reads from the fd directly, but that's taken care since > > guest_memfd doesn't have a .splice_read() handler. > > > > Reading /proc/pid/mem also requires the pages to first be mapped, IIUC, > > otherwise the pages won't show up, so checking that there are no more > > mappings to userspace takes care of this. > > You have a misconception. > > You can map pages to user space, GUP them, and then unmap them from user > space. A GUP reference can outlive your user space mappings, easily. > > So once there is a raised refcount, it could as well just be from vmsplice, > or a pending reference from /proc/pid/mem, O_DIRECT, ... > > > > > > > > > Of course, there are some corner cases (kgdb, hibernation, /proc/kcore), > > > but most of these can be dealt with in one way or the other (make these > > > back off and not read/write page content, similar to how we handled it > > > for secretmem). > > > > Does that really leave us with these corner cases? And so perhaps we > > could get away with just taking the folio_lock() to keep away the > > speculative references? So something like > > > > 1. Check that the folio is not mapped and not pinned. > > To do that, you have to lookup the folio first. That currently requires a > refcount increment, even if only temporarily. Maybe we could avoid that, if > we can guarantee that we are the only one modifying the pageache here, and > we sync against that ourselves. > > > 2. folio_lock() all the folios about to be removed from direct map > > -- With the lock, all other accesses should be speculative -- > > 3. Check that the refcount == "safe" refcount > > 3a. Unlock and return to userspace with -EAGAIN > > 4. Remove from direct map > > 5. folio_unlock() all those folios > > > > Perhaps a very naive question: can the "safe" refcount be statically > > determined by walking through the code and counting where refcount is > > expected to be incremented? > > > Depends on how we design it. But if you hand out "safe" references to KVM > etc, you'd have to track that -- and how often -- somehow. At which point we > are at "increment/decrement" safe reference to track that for you. > Just a status update: I've gotten the "safe" reference counter implementation working for Gunyah now. It feels a bit flimsy because we're juggling 3 reference counters*, but it seems like the right thing to do after all the discussions here. It's passing all the Gunyah unit tests I have which have so far been pretty good at finding issues. I need to clean up the patches now and I'm aiming to have it out for RFC next week. * folio refcount, "accessible" refcount, and "safe" refcount Thanks, Elliot