Re: [PATCH 2/4 v0.5] sched/umcg: RFC: add userspace atomic helpers

"Andy Lutomirski" <luto@xxxxxxxxxx> · Tue, 14 Sep 2021 11:40:01 -0700

On Tue, Sep 14, 2021, at 11:11 AM, Peter Zijlstra wrote:
> On Tue, Sep 14, 2021 at 09:52:08AM -0700, Andy Lutomirski wrote:
> > With a custom mapping, you don’t need to pin pages at all, I think.
> > As long as you can reconstruct the contents of the shared page and
> > you’re willing to do some slightly careful synchronization, you can
> > detect that the page is missing when you try to update it and skip the
> > update. The vm_ops->fault handler can repopulate the page the next
> > time it’s accessed.
> 
> The point is that the moment we know we need to do this user-poke, is
> schedule(), which could be called while holding mmap_sem (it being a
> preemptable lock). Which means we cannot go and do faults.

That’s fine. The page would be in one or two states: present and writable by kernel or completely gone. If its present, the scheduler writes it. If it’s gone, the scheduler skips the write and the next fault fills it in.

> 
> > All that being said, I feel like I’m missing something. The point of
> > this is to send what the old M:N folks called “scheduler activations”,
> > right?  Wouldn’t it be more efficient to explicitly wake something
> > blockable/pollable and write the message into a more efficient data
> > structure?  Polling one page per task from userspace seems like it
> > will have inherently high latency due to the polling interval and will
> > also have very poor locality.  Or am I missing something?
> 
> The idea was to link the user structures together in a (single) linked
> list. The server structure gets a list of all the blocked tasks. This
> avoids having to a full N iteration (like Java, they're talking stupid
> number of N).
> 
> Polling should not happen, once we run out of runnable tasks, the server
> task gets ran again and it can instantly pick up all the blocked
> notifications.
> 

How does the server task know when to read the linked list?  And what’s wrong with a ring buffer or a syscall?