> On Feb 13, 2022, at 8:02 PM, Peter Xu <peterx@xxxxxxxxxx> wrote: > > Thanks for explaining. > > I also digged out the discussion threads between you and Mike and that's a good > one too summarizing the problems: > > https://lore.kernel.org/all/5921BA80-F263-4F8D-B7E6-316CEB602B51@xxxxxxxxx/ > > Scenario 4 is kind of special imho along all those, because that's the only one > that can be workarounded by user application by only copying pages one by one. > I know you were even leveraging iouring in your local tree, so that's probably > not a solution at all for you. But I'm just trying to start thinking without > that scenario for now. > > Per my understanding, a major issue regarding the rest of the scenarios is > ordering of uffd messages may not match with how things are happening. This > actually contains two problems. > > First of all, mmap_sem is mostly held read for all page faults and most of the > mm changes except e.g. fork, then we can never serialize them. Not to mention > uffd events releases mmap_sem within prep and completion. Let's call it > problem 1. > > The other problem 2 is we can never serialize faults against events. > > For problem 1, I do sense something that mmap_sem is just not suitable for uffd > scenario. Say, we grant concurrent with most of the events like dontneed and > mremap, but when uffd ordering is a concern we may not want to grant that > concurrency. I'm wondering whether it means uffd may need its own semaphore to > achieve this. So for all events that uffd cares we take write lock on a new > uffd_sem after mmap_sem, meanwhile we don't release that uffd_sem after prep of > events, not until completion (the message is read). It'll slow down uffd > tracked systems but guarantees ordering. Peter, Thanks for finding the time and looking into the issues that I encountered. Your approach sounds possible, but it sounds to me unsafe to acquire uffd_sem after mmap_lock, since it might cause deadlocks (e.g., if a process uses events to manage its own memory). > > At the meantime, I'm wildly thinking whether we can tackle with the other > problem by merging the page fault queue with the event queue, aka, event_wqh > and fault_pending_wqh. Obviously we'll need to identify the messages when > read() and conditionally move then into fault_wqh only if they come from page > faults, but that seems doable? This, I guess is necessary in addition to your aforementioned proposal to have some semaphore protecting, can do the trick. While I got your attention, let me share some other challenges I encountered using userfaultfd. They might be unrelated, but perhaps you can keep them in the back of your mind. Nobody should suffer as I did ;-) 1. mmap_changing (i.e., -EAGAIN on ioctls) makes using userfaultfd harder than it should be, especially when using io-uring as I wish to do. I think it is not too hard to address by changing the API. For instance, if uffd-ctx had a uffd-generation that would increase on each event, the user could have provided an ioctl-generation as part of copy/zero/etc ioctls, and the kernel would only fail the operation if ioctl copy/zero/etc operation only succeeds if the uffd-generation is lower/equal than the one provided by the user. 2. userfaultfd is separated from other tracing/instrumentation mechanisms in the kernel. I, for instance, also wanted to track mmap events (let’s put aside for a second why). Tracking these events can be done with ptrace or perf_event_open() but then it is hard to correlate these events with userfaultfd. It would have been easier for users, I think, if userfaultfd notifications were provided through ptrace/tracepoints mechanisms as well. 3. Nesting/chaining. It is not easy to allow two monitors to use userfaultfd concurrently. This seems as a general problem that I believe ptrace suffers from too. I know it might seem far-fetched to have 2 monitors at the moment, but I think that any tracking/instrumentation mechanism (e.g., ptrace, software-dirty, not to mention hardware virtualization) should be designed from the beginning with such support as adding it in a later stage can be tricky. 4. Missing state. It would be useful to provide the TID of the faulting thread. I will send a patch for this one once I get the necessary internal approvals. Thanks again, Nadav