On Wed, Sep 15, 2021 at 8:45 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > On Tue, Sep 14, 2021 at 11:40:01AM -0700, Andy Lutomirski wrote: > > > > > > On Tue, Sep 14, 2021, at 11:11 AM, Peter Zijlstra wrote: > > > On Tue, Sep 14, 2021 at 09:52:08AM -0700, Andy Lutomirski wrote: > > > > With a custom mapping, you don’t need to pin pages at all, I think. > > > > As long as you can reconstruct the contents of the shared page and > > > > you’re willing to do some slightly careful synchronization, you can > > > > detect that the page is missing when you try to update it and skip the > > > > update. The vm_ops->fault handler can repopulate the page the next > > > > time it’s accessed. > > > > > > The point is that the moment we know we need to do this user-poke, is > > > schedule(), which could be called while holding mmap_sem (it being a > > > preemptable lock). Which means we cannot go and do faults. > > > > That’s fine. The page would be in one or two states: present and > > writable by kernel or completely gone. If its present, the scheduler > > writes it. If it’s gone, the scheduler skips the write and the next > > fault fills it in. > > That's non-deterministic, and as such not suitable. What's the precise problem? The code would be roughly: if (try_pin_the_page) { write it; unpin; } else { do nothing -- .fault will fill in the correct contents. } The time this takes is nondeterministic, but it's bounded and short. > > > > > All that being said, I feel like I’m missing something. The point of > > > > this is to send what the old M:N folks called “scheduler activations”, > > > > right? Wouldn’t it be more efficient to explicitly wake something > > > > blockable/pollable and write the message into a more efficient data > > > > structure? Polling one page per task from userspace seems like it > > > > will have inherently high latency due to the polling interval and will > > > > also have very poor locality. Or am I missing something? > > > > > > The idea was to link the user structures together in a (single) linked > > > list. The server structure gets a list of all the blocked tasks. This > > > avoids having to a full N iteration (like Java, they're talking stupid > > > number of N). > > > > > > Polling should not happen, once we run out of runnable tasks, the server > > > task gets ran again and it can instantly pick up all the blocked > > > notifications. > > > > > > > How does the server task know when to read the linked list? And > > what’s wrong with a ring buffer or a syscall? > > Same problem, ring-buffer has the case where it's full and events get > dropped, at which point you've completely lost state. If it is at all > possible to recover from that, doing so is non-deterministic. > > I really want this stuff to work for realtime workloads too. A ring buffer would have a bounded size -- one word (of whatever size) per user thread.