Re: [PATCH 2/4 v0.5] sched/umcg: RFC: add userspace atomic helpers

Andy Lutomirski <luto@xxxxxxxxxx> · Wed, 15 Sep 2021 09:50:41 -0700

On Wed, Sep 15, 2021 at 8:45 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Tue, Sep 14, 2021 at 11:40:01AM -0700, Andy Lutomirski wrote:
> >
> >
> > On Tue, Sep 14, 2021, at 11:11 AM, Peter Zijlstra wrote:
> > > On Tue, Sep 14, 2021 at 09:52:08AM -0700, Andy Lutomirski wrote:
> > > > With a custom mapping, you don’t need to pin pages at all, I think.
> > > > As long as you can reconstruct the contents of the shared page and
> > > > you’re willing to do some slightly careful synchronization, you can
> > > > detect that the page is missing when you try to update it and skip the
> > > > update. The vm_ops->fault handler can repopulate the page the next
> > > > time it’s accessed.
> > >
> > > The point is that the moment we know we need to do this user-poke, is
> > > schedule(), which could be called while holding mmap_sem (it being a
> > > preemptable lock). Which means we cannot go and do faults.
> >
> > That’s fine. The page would be in one or two states: present and
> > writable by kernel or completely gone. If its present, the scheduler
> > writes it. If it’s gone, the scheduler skips the write and the next
> > fault fills it in.
>
> That's non-deterministic, and as such not suitable.

What's the precise problem?  The code would be roughly:

if (try_pin_the_page) {
  write it;
  unpin;
} else {
  do nothing -- .fault will fill in the correct contents.
}

The time this takes is nondeterministic, but it's bounded and short.

>
> > > > All that being said, I feel like I’m missing something. The point of
> > > > this is to send what the old M:N folks called “scheduler activations”,
> > > > right?  Wouldn’t it be more efficient to explicitly wake something
> > > > blockable/pollable and write the message into a more efficient data
> > > > structure?  Polling one page per task from userspace seems like it
> > > > will have inherently high latency due to the polling interval and will
> > > > also have very poor locality.  Or am I missing something?
> > >
> > > The idea was to link the user structures together in a (single) linked
> > > list. The server structure gets a list of all the blocked tasks. This
> > > avoids having to a full N iteration (like Java, they're talking stupid
> > > number of N).
> > >
> > > Polling should not happen, once we run out of runnable tasks, the server
> > > task gets ran again and it can instantly pick up all the blocked
> > > notifications.
> > >
> >
> > How does the server task know when to read the linked list?  And
> > what’s wrong with a ring buffer or a syscall?
>
> Same problem, ring-buffer has the case where it's full and events get
> dropped, at which point you've completely lost state. If it is at all
> possible to recover from that, doing so is non-deterministic.
>
> I really want this stuff to work for realtime workloads too.

A ring buffer would have a bounded size -- one word (of whatever size)
per user thread.