Re: [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset

Peter Oskolkov <posk@xxxxxxxxxx> · Thu, 10 Jun 2021 13:06:01 -0700

On Thu, Jun 10, 2021 at 11:02 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:

Thanks a lot for the detailed reply!

I'll try again the data-in-userspace-only route (= everything in TLS):
if you are right and everything can be done without needing to lock
anything - great!

The last time I tried it I could not do it properly/safely, though,
because I could not fix races without rcu and/or spin locking stuff,
which was impossible with data in the userspace. I don't remember the
specifics now, though...

Thanks,
Peter

>
> On Wed, Jun 09, 2021 at 01:18:59PM -0700, Peter Oskolkov wrote:
> > On Wed, Jun 9, 2021 at 5:55 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >
> > Finally, a high-level review - thanks a lot, Peter! My comments below,
> > and two high-level "important questions" at the end of my reply (with
> > some less important questions here and there).
> >
> > [...]
> >
> > > You present an API without explaining, *at*all*, how it's supposed to be
> > > used and I can't seem to figure it out from the implementation either :/
> >
> > I tried to explain it in the doc patch that I followed up with:
> > https://lore.kernel.org/patchwork/cover/1433967/#1632328
>
> Urgh, you write RST :-( That sorta helps, but I'm still unclear on a
> number of things, more below.
>
> > Or do you mean it more narrowly, i.e. I do not explain syscalls in
> > detail? This assessment I agree with - my approach was/is to finalize
> > the userpace API (libumcg) first, and make the userspace vs kernel
> > decisions later.
>
> Yeah, I couldn't figure out how to use the syscalls and thus how to
> interpret their implementation. A little more in the way of comments
> would've been helpful.
>
> > For example, you wonder why there is no looping in umcg_wait
> > (do_wait). This is because the looping happens in the userspace in
> > libumcg. My overall approach was to make the syscalls as simple as
> > possible and push extra logic to the userspace.
>
> So a simple comment on the syscall that says:
>
>   Userspace is expected to do:
>
>         do {
>                 sys_umcg_wait();
>         } while (smp_load_acquire(&umcg_task->state) != RUNNING);
>
> would've made all the difference. It provides context.
>
> > It seems that this approach is not resonating with kernel
> > developers/maintainers - you are the third person asking why there is
> > no looping in sys_umcg_wait, despite the fact that I explicitly
> > mentioned pushing it out to the userspace.
>
> We've been trained, through years of 'funny' bugs, to go 'BUG BUG BUG'
> when schedule() is not in a loop. And pushing the loop to userspace has
> me all on edge for being 'weird'.
>
> > Let me try to make my case once more here.
> >
> > umcg_wait/umcg_wake: the RUNNABLE/RUNNING state changes, checks, and
> > looping happen in the userspace (libumcg - see umcg_wait/umcg_wake in
> > patch 5 here: https://lore.kernel.org/patchwork/patch/1433971/), while
> > the syscalls simply sleep/wake. I find doing it in the userspace is
> > much simpler and easier than in the kernel, as state reads and writes
> > are just atomic memory accesses; in the kernel it becomes much more
> > difficult - rcu locked sections, tasks locked, etc.
>
> Small difficulties as far as things go I think. The worst part is having
> to do arch asm for the userspace cmpxchg. Luckily we can crib/share with
> futex there.
>
> > On the other hand I agree that having syscalls more logically
> > complete, in the sense that they do not require much hand-holding and
> > retries from the userspace, is probably better from the API design
> > perspective. My worry here is that state validation and retries in the
> > userspace are unavoidable, and so going the usual way we will end up
> > with retry loops both in the kernel and in the userspace.
>
> Can you expand on where you'd see the need for userspace to retry?
>
> The canonical case in my mind is where a task, that's been BLOCKED in
> kernelspace transitions to UNBLOCK/RUNNABLE in return-to-user and waits
> for RUNNING.
>
> Once it gets RUNNING, userspace can assume it can just go. It will never
> have to re-check, because there's no way RUNNING can go away again. The
> only way for RUNNING to become anything else, is setting it yourself
> and/or doing a syscall.
>
> Also, by having the BLOCKED thing block properly in return-to-user, you
> don't have to wrap *all* the userspace syscall invocations. If you let
> it return early, you get to wrap syscalls, which is both fragile and
> bad for performance.
>
> > So I pose this IMPORTANT QUESTION #1 to you that I hope to get a clear
> > answer to: it is strongly preferable to have syscalls be "logically
> > complete" in the sense that they retry things internally, and in
> > generally try to cover all possible corner cases; or, alternatively,
> > is it OK to make syscalls lightweight but "logically incomplete", and
> > have the accompanied userspace wrappers do all of the heavy lifting
> > re: state changes/validation, retries, etc.?
>
> Intuitively I'd go with complete. I'd have never even considered the
> incomplete option. But let me try and get my head around the incomplete
> cases.
>
> Oooh, I found the BLOCKED stuff, you hid it inside the grouping patch,
> that makes no sense :-( Reason I'm looking is that I don't see how you
> get around the blocked and runnable lists. You have to tell userspace
> about them.
>
> FWIW: I think you placed umcg_on_block() wrong, it needs to be before
> the terrible PI thing. Also, like said, please avoid yet another branch
> here by using PF_UMCG_WORKER.
>
> > I see two additional benefits of thin/lightweight syscalls:
> > - reading userspace state is needed much less often (e.g. my umcg_wait
> > and umcg_wake syscalls do not access userspace data at all - also see
> > my "second important question" below)
>
> It is also broken I think, best I can make of it is somsething like
> this:
>
>     WAIT                                                WAKE
>
>     if (smp_load_acquire(&state) == RUNNING)
>         return;
>
>                                                         state = RUNNING;
>
>
>     do {
>       sys_umcg_wait()
>       {
>         in_wait = true;
>                                                         sys_umcg_wake()
>                                                         {
>                                                           if (in_wait)
>                                                             wake_up_process()
>                                                         }
>         set_current_state(INTERRUPTIBLE);
>         schedule();
>         in_wait = false;
>       }
>     } while (smp_load_acquire(&state) != RUNNING);
>
>
> missed wakeup, 'forever' stuck. You have to check your blocking
> condition between setting state and scheduling. And if you do that, you
> have a 'fat' syscall again.
>
> > - looping in the kernel, combined with reading/writing to userspace
> > memory, can easily lead to spinning in the kernel (e.g. trying to
> > atomically change a variable and looping until succeeding)
>
> I don't imagine spinning in kernel or userspace matters.
>
> > > Also, do we really need unregister over just letting a task
> > > exit? Is there a sane use-case where task goes in and out of service?
> >
> > I do not know of a specific use case here. On the other hand, I do not
> > know of a specific use case to unregister RSEQ, but the capability is
> > there. Maybe the assumption is that the userspace memory passed to the
> > kernel in register() may be freed before the task exits, and so there
> > should be a way to tell the kernel to no longer use it?
>
> Fair enough I suppose.
>
> > >
> > > > +450  common  umcg_wait               sys_umcg_wait
> > > > +451  common  umcg_wake               sys_umcg_wake
> > >
> > > Right, except I'm confused by the proposed implementation. I thought the
> > > whole point was to let UMCG tasks block in kernel, at which point we'd
> > > change their state to BLOCKED and have userspace select another task to
> > > run. Such BLOCKED tasks would then also be captured before they return
> > > to userspace, i.e. the whole admission scheduler thing.
> > >
> > > I don't see any of that in these patches. So what are they actually
> > > implementing? I can't find enough clues to tell :-(
> >
> > As I mentioned above, state changes are done in libumcg in userspace
> > here: https://lore.kernel.org/patchwork/cover/1433967/#1632328
> >
> > If you insist this logic should live in the kernel, I'll do it (grudgingly).
>
> So you have some of it, I just didn't find it because it's hidding in
> that grouping thing.
>
> > > > +452  common  umcg_swap               sys_umcg_swap
> > >
> > > You're presenting it like a pure optimization, but IIRC this is what
> > > enables us to frob the scheduler state to ensure the whole thing is seen
> > > (to the rest of the system) as the M server tasks, instead of the
> > > constellation of N+M worker and server tasks.
> >
> > Yes, you recall it correctly.
> >
> > > Also, you're not doing any of the frobbing required.
> >
> > This is because I consider the frobbing a (very) nice to have rather
> > than a required feature, and so I am hoping to argue about how to
> > properly do it in later patchsets. This whole thing (UMCG) will be
> > extremely useful even without runtime accounting hacking and whatnot,
> > and so I hope to have everything else settled and tested and merged
> > before we spend another several weeks/months trying to make the
> > frobbing perfect.
>
> Sure, not saying you need the frobbing from the get-go, but it's a much
> stronger argument for having the API in the first place. So mentioning
> this property (along with a TODO) is a stronger justification.
>
> This goes to *why again. It's fairly easy to see what from the code, but
> code rarely explains why.
>
> That said; if we do: @next_pid, we might be able to do away with this. A
> !RUNING transition will attempt to wake-and-switch to @next_tid. This is
> BLOCKED from syscall or explicit using umcg_wait().
>
> > > > +455  common  umcg_poll_worker        sys_umcg_poll_worker
> > >
> > > Shouldn't this be called idle or something, instead of poll, the whole
> > > point of having this syscall is to that you can indeed go idle.
> >
> > That's another way of looking at it. Yes, this means the server idles
> > until a worker becomes available. How would you call it? umcg_idle()?
>
> I'm trying to digest the thing; it's doing *far* more than just idling,
> but yes, sys_umcg_idle() or something.
>
> > > This I'm confused about again.. there is no fundamental difference
> > > between a worker or server, they're all the same.
> >
> > I don't see it this way. A server runs (on CPU) by itself and blocks
> > when there is a worker attached; a worker runs (on CPU) only when it
> > has a (blocked) server attached to it and, when the worker blocks, its
> > server detaches and runs another worker. So workers and servers are
> > the opposite of each other.
>
> So I was viewing the server more like the idle thread, its 'work' is
> idle, which is always available.
>
> > > > + */
> > > > +#define UMCG_TASK_NONE                       0
> > > > +/* UMCG server states. */
> > > > +#define UMCG_TASK_POLLING            1
> > > > +#define UMCG_TASK_SERVING            2
> > > > +#define UMCG_TASK_PROCESSING         3
> > >
> > > I get POLLING, although per the above, this probably wants to be IDLE.
> >
> > Ack.
> >
> > >
> > > What are the other two again? That is, along with the diagram, each
> > > state wants a description.
> >
> > SERVING: the server is blocked, its attached worker is running
> > PROCESSING: the server is running (= processing a block or wake
> > event), has no running worker attached
> >
> > Both of these states are different from POLLING/IDLE and from each other.
>
> But if we view the server as the worker with work 'idle', then serving
> becomes RUNNABLE and PROCESSING becomes RUNNING, right?
>
> And sys_run_worker(next); becomes:
>
>         self->state = RUNNABLE;
>         self->next_tid = next;
>         sys_umcg_wait();
>
> The question is if we need an explicit IDLE state along with calling
> sys_umcg_idle(). I can't seem to make up my mind on that.
>
> > > > +/* UMCG worker states. */
> > > > +#define UMCG_TASK_RUNNABLE           4
> > > > +#define UMCG_TASK_RUNNING            5
> > > > +#define UMCG_TASK_BLOCKED            6
> > > > +#define UMCG_TASK_UNBLOCKED          7
> > >
> > > Weird order, also I can't remember why we need the UNBLOCKED, isn't that
> > > the same as the RUNNABLE, or did we want to distinguish the state were
> > > we're no longer BLOCKED but the user scheduler hasn't yet put us on it's
> > > ready queue (IOW, we're on the runnable_ptr list, see below).
> >
> > Yes, UNBLOCKED it a transitory state meaning the worker's blocking
> > operation has completed, but the wake event hasn't been delivered to
> > the userspace yet (and so the worker it not yet RUNNABLE)
>
> So if I understand the proposal correctly the only possible option is
> something like:
>
>         for (;;) {
>                 next = user_sched_pick();
>                 if (next) {
>                         sys_umcg_run(next);
>                         continue;
>                 }
>
>                 sys_umcg_poll(&next);
>                 if (next) {
>                         next->state = RUNNABLE;
>                         user_sched_enqueue(next);
>                 }
>         }
>
> This seems incapable of implementing generic scheduling policies and has
> a hard-coded FIFO policy.
>
> The poll() thing cannot differentiate between: 'find new task' and 'go
> idle'. So you cannot keep running it until all new tasks are found.
>
> But you basically get to do a syscall to discover every new task, while
> the other proposal gets you a user visible list of new tasks, no
> syscalls needed at all.
>
> It's also not quite clear to me what you do about RUNNING->BLOCKED, how
> does the userspace scheduler know to dequeue a task?
>
>
> My proposal gets you something like:
>
>         for (;;) {
>                 self->state = RUNNABLE;
>                 self->next_tid = 0; // next == self == server -> idle
>
>                 p = xchg(self->blocked_ptr, NULL);
>                 while (p) {
>                         n = new->blocked_ptr;
>                         user_sched_dequeue(p);
>                         p = n;
>                 }
>
>                 // Worker can have unblocked again before we got here,
>                 // hence we need to process blocked before runnable.
>                 // Worker cannot have blocked again, since we didn't
>                 // know it was runnable, hence it cannot have ran again.
>
>                 p = xchg(self->runnable_ptr, NULL);
>                 while (p) {
>                         n = new->runnable_ptr;
>                         user_sched_enqeue(p);
>                         p = n;
>                 }
>
>                 n = user_sched_pick();
>                 if (n)
>                         self->next_tid = n->tid;
>
>                 // new self->*_ptr state will have changed self->state
>                 // to RUNNING and we'll not switch to ->next.
>
>                 sys_umcg_wait();
>
>                 // self->state == RUNNING
>         }
>
> This allows you to implement arbitrary policies and instantly works with
> preemption once we implement that. Preemption would put the running
> worker in RUNNABLE, mark the server RUNNING and switch.
>
> Hmm, looking at it written out like that, we don't need sys_umcg_wake(),
> sys_umcg_swap() at all.
>
> Anyway, and this is how I got here, UNBLOCKED is not required because we
> cannot run it before we've observed it RUNNABLE. Yes the state exists
> where it's no longer BLOCKED, and it's not yet on the runqueue, but when
> we don't know it's RUNNABLE we'll not pick it, so its moot.
>
> > > So last time I really looked at this it looked something like this:
> > >
> > > struct umcg_task {
> > >         u32     umcg_status;            /* r/w */
> > >         u32     umcg_server_tid;        /* r   */
> > >         u32     umcg_next_tid;          /* r   */
> > >         u32     __hole__;
> > >         u64     umcg_blocked_ptr;       /*   w */
> > >         u64     umcg_runnable_ptr;      /*   w */
> > > };
> > >
> > > (where r/w is from the kernel's pov)
> > > (also see uapi/linux/rseq.h's ptr magic)
> >
> > I tried doing it this way, i.e. to only have only userspace struct
> > added (without kernel-only data), and I found it really cumbersome and
> > inconvenient and much slower than the proposed implementation.
>
> > For example, when a worker blocks, it seems working with "struct
> > task_struct *peer" to get to the worker's server is easy and
> > straightforward; reading server_tid from userspace, then looking up
> > the task and only then doing what is needed (change state and wakeup)
> > is ... unnecessary?
>
> Is find_task_by_vpid() really that slow? The advantage of having it in
> userspace is that you can very easily change 'affinities' of the
> workers. You can simply set ->server_tid and it goes elsewhere.
>
> > Also validating things becomes really important
> > but difficult (what if the user put something weird in
> > umcg_server_tid? or the ptr fields?).
>
> If find_task_by_vpid() returns NULL, we return -ESRCH. If the user
> cmpxchg returns -EFAULT we pass along the message. If userspace put a
> valid but crap pointer in it, userspace gets to keep the pieces.
>
> > In my proposed implementation only the state is user-writable, and it
> > does not really affect most of the kernel-side work.
> >
> > Why do you think everything should be in the userspace memory?
>
> Because then we avoid all the kernel state and userspace gets to have
> all the state without endless syscalls.
>
> Note that with the proposal, per the above, we're at:
>
> enum {
>         UMCG_STATE_RUNNING,
>         UMCG_STATE_RUNABLE,
>         UMCG_STATE_BLOCKED,
> };
>
> struct umcg_task {
>         u32     umcg_status;            /* r/w */
>         u32     umcg_server_tid;        /* r   */
>         u32     umcg_next_tid;          /* r   */
>         u32     umcg_tid;               /* r   */
>         u64     umcg_blocked_ptr;       /*   w */
>         u64     umcg_runnable_ptr;      /*   w */
> };
>
> /*
>  * Register current's UMCG state.
>  */
> sys_umcg_register(struct umcg_task *self, unsigned int flags);
>
> /*
>  * Just 'cause.
>  */
> sys_umcg_unregister(struct umcg_task *self)
>
> /*
>  * UMCG context switch.
>  */
> sys_umcg_wait(u64 time, unsigned int flags)
> {
>         unsigned int state = RUNNABLE;
>         unsigned int tid;
>
>         if (self->state == RUNNING)
>                 return;
>
>         tid = self->next_tid;
>         if (!tid)
>                 tid = self->server_tid;
>
>         if (tid == self->server_tid && tid == self->tid)
>                 return umcg_idle(time, flags);
>
>         next = find_process_by_pid(tid);
>         if (!next) {
>                 return -ESRCH;
>
>         ret = user_try_cmpxchg(next->umcg->state, &state, RUNNING);
>         if (!ret)
>                 ret = -EBUSY;
>         if (ret < 0)
>                 return ret;
>
>         return umcg_switch_to(next);
> }
>
> With this (and the BLOCKING bits outlined last time) we can implement
> full N:1 userspace scheduling (UP).
>
> ( Note that so far we assume all UMCG workers share the same address
>   space, otherwise the user_try_cmpxchg() doesn't work. )
>
> And I _think_ you can do the whole SMP thing in userspace as well, just
> have the servers share queue state and reassign ->server_tid where
> needed. No additional syscalls required.
>
> > All of the code above assumes userspace-only data. I did not look into
> > every detail of your suggestions because I want to make sure we first
> > agree on this: do we keep every bit of information in the userspace
> > (other than "struct umcg_task __user *" pointer in task_struct) or do
> > we have some kernel-only details as well?
>
> Most of the kernel state you seem to have implemented seems to limit
> flexibility / implement specific policy. All because apparently
> find_task_by_vpid() is considered expensive?
>
> You've enangled the whole BLOCKING stuff with the SMP stuff. And by
> putting that state in the kernel you've limited flexibility.
>
> Also, if you don't have kernel state it can't go out of sync and cause
> problems.
>
> > So IMPORTANT QUESTION #2: why would we want to keep __everything__ in
> > the userspace memory? I understand that CRIU would like this, but
> > given that the implementation would at a minimum have to
> >
> > 1. read a umcg_server_ptr (points to the server's umcg_task)
> > 2. get the server tid out of it (presumably by reading a field from
> > the server's umcg_task; what if the tid is wrong?)
> > 3. do a tid lookup
>
> So if we leave SMP as an exercise in scheduling queue management, And
> implement the above, then you need:
>
>  - copy_from_user()/get_user() for the first 4 words
>  - find_task_by_vpid()
>
> that gets you a task pointer, then we get to update a blocked_ptr.
>
> If anything goes wrong, simply return an error and let userspace sort it
> out.
>
> > to get a task_struct pointer, it will be slower; I am also not sure it
> > call be done safely at all: with kernel-side data and I can do rcu
> > locking, task locking, etc. to ensure that the value I got does not
> > change while I'm working with it; with userspace data, a lot of races
> > will have to be specially coded for that can be easily handled by
> > kernel-side rcu locks or spin locks... Maybe this is just my ignorance
> > showing, and indeed things can be done simply and easily with
> > userspace-only data, but I am not sure how.
> >
> > A common example:
> >
> > - worker W1 with server S1 calls umcg_wait()
> > - worker W2 with server S2 calls umcg_swap(W1)
> >
> > If due to preemption and other concurrency weirdness the two syscalls
> > above race with each other, each trying to change the server assigned
> > to W1. I can easily handle the race by doing kernel-side locking;
> > without kernel-side locking (cannot do rcu locks and/or spin locks
> > while accessing userspace data) I am not sure how to handle the race.
> > Maybe it is possible with careful atomic writes to states and looping
> > to handle this specific race (what if the userspace antagonistically
> > writes to the same location? will it force the syscall to spin
> > indefinitely?); but with proper locking many potential races can be
> > handled; with atomic ops and looping it is more difficult... Will we
> > have to add a lock to struct umcg_task? And acquire it from the kernel
> > side? And worry about spinning forever?
>
> What would you want locking for? I really don't see a problem here.
>
> Both blocked_ptr and runnable_ptr are cmpxchg single-linked-lists. Yes
> they can spin a little, but that's not a new problem, futex has all
> that.
>
> And ->state only needs single cmpxchg ops, no loops, either we got the
> wakeup, or we didn't. The rest is done with memory ordering:
>
>
>         server                          worker
>
>         self->state = RUNNABLE;         self->state = BLOCKED;
>
>         head = xchg(list, NULL)         add_to_list(self, &server->blocked_ptr);
>
>                                         if (try_cmpxchg_user(&server->umcg->state, RUNNABLE, RUNNING) > 0)
>         sys_umcg_wait()                         wake_up_process(server);
>
>
> Either server sees the add, or we see it's RUNNABLE and wake it up
> (or both).
>
> If anything on the BLOCKED side goes wrong (bad pointers, whatever),
> have it segfault.
>
> <edit> Ooh, I forgot you can't just go wake the server when it's running
> something else... so that does indeed need more states/complication, the
> ordering argument stands though. We'll need something like
> self->current_tid or somesuch </edit>
>
>
> > > > +SYSCALL_DEFINE2(umcg_wait, u32, flags,
> > > > +             const struct __kernel_timespec __user *, timeout)
> > >
> > > I despise timespec, tglx?
> >
> > What are the alternatives? I just picked what the futex code uses.
>
> u64 nanoseconds. Not sure tglx really wants to do that though, but
> still, timespec is a terrible thing.
>
> > In summary, two IMPORTANT QUESTIONS:
> >
> > 1. thin vs fat syscalls: can we push some code/logic to the userspace
> > (state changes, looping/retries), or do we insist on syscalls handling
> > everything?
>
> Well, the way I see it it's a trade of what is handled where. I get a
> smaller API (although I'm sure I've forgotten something trivial again
> that wrecks everything <edit> I did :/ </edit>) and userspace gets to
> deal with all of SMP and scheduling policies.
>
> You hard-coded a global-fifo and had a enormous number of syscalls and
> needed to wrap every syscall invocation in order to fix up the return.
>
> > Please have in mind that even if we choose the second
> > approach (fat syscalls), the userspace will most likely still have to
> > do everything it does under the first option just to handle
> > signals/interrupts (i.e. unscheduled wakeups);
>
> IIRC sigreturn goes back into the kernel and we can resume blocking
> there.
>
> > 2. kernel-side data vs userspace-only: can we avoid having kernel-side
> > data? More specifically, what alternatives to rcu_read_lock and/or
> > task_lock are available when working with userspace data?
>
> What would you want locked and why?
>
>
> Anyway, this email is far too long again (basically took me all day :/),
> hope it helps a bit. Thomas is stuck fixing XSAVE disasters, but I'll
> ask him to chime in once that's done.
>
>