On 31.01.22 18:23, Nadav Amit wrote: > >> On Jan 31, 2022, at 2:42 AM, Mike Rapoport <rppt@xxxxxxxxxx> wrote: >> >> Hi Nadav, >> >> On Sat, Jan 29, 2022 at 10:23:55PM -0800, Nadav Amit wrote: >>> Using userfautlfd and looking at the kernel code, I encountered a usability >>> issue that complicates userspace UFFD-monitor implementation. I obviosuly >>> might be wrong, so I would appreciate a (polite?) feedback. I do have a >>> userspace workaround, but I thought it is worthy to share and to hear your >>> opinion, as well as feedback from other UFFD users. >>> >>> The issue I encountered regards the ordering of UFFD events tbat might not >>> reflect the actual order in which events took place. >>> >>> In more detail, UFFD events (e.g., unmap, fork) are not ordered against >>> themselves [*]. The mm-lock is dropped before notifying the userspace >>> UFFD-monitor, and therefore there is no guarantee as to whether the order of >>> the events actually reflects the order in which the events took place. >>> This can prevent a UFFD-monitor from using the events to track which >>> ranges are mapped. Specifically, UFFD_EVENT_FORK message and a >>> UFFD_EVENT_UNMAP message (which reflects unmap in the parent process) can >>> be reordered, if the events are triggered by two different threads. In >>> this case the UFFD-monitor cannot figure from the events whether the >>> child process has the unmapped memory range still mapped (because fork >>> happened first) or not. >> >> Yeah, it seems that something like this is possible: >> >> >> fork() munmap() >> mmap_write_unlock(); >> mmap_write_lock_killable(); >> do_things(); >> mmap_{read,write}_unlock(); >> userfaultfd_unmap_complete(); >> dup_userfaultfd_complete(); >> >> A solution could be to split uffd_*_complete() to two parts: one that >> queues up the event message and the second one that waits for it to be read >> by the monitor. The first part then can run befor mm-lock is released. >> >> If you can think of something nicer, it'll be really great! > > Thanks for the quick response. Your solution is possible, but then the > order between events and page-faults is certainly not kept - as David > mentioned: regardless of mm-lock that is not always taken for write, > events and page-faults are on two separate lists, and queued page-faults > are reported before events. Of course, for the issue I brought up (if it's a real issue), the question is if we could "adjust the documentation" to state that there are no ordering guarantees. IMHO at least the fork()+munmap() needs a proper fix, because otherwise, we might really end up with an API that's partially useless -- as you correctly state. > > I am also not sure how simple/performant it is, since it would require > an additional refcount for userfaultfd_wait_queue to prevent it from > disappearing between the time it is enqueued to the time it blocks. > > Another option is to associate some “generation” or “sequence number” > with every event and change the PAI to include it. It still leaves the > problem of ordering MADV_DONTNEED and page-faults though. > My first thought was to include a timestamp. But requiring user space to restore the order based on a timestamp might be really ... weird. -- Thanks, David / dhildenb