On Thu, Jun 10, 2021 at 11:02 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: Thanks a lot for the detailed reply! I'll try again the data-in-userspace-only route (= everything in TLS): if you are right and everything can be done without needing to lock anything - great! The last time I tried it I could not do it properly/safely, though, because I could not fix races without rcu and/or spin locking stuff, which was impossible with data in the userspace. I don't remember the specifics now, though... Thanks, Peter > > On Wed, Jun 09, 2021 at 01:18:59PM -0700, Peter Oskolkov wrote: > > On Wed, Jun 9, 2021 at 5:55 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > > > Finally, a high-level review - thanks a lot, Peter! My comments below, > > and two high-level "important questions" at the end of my reply (with > > some less important questions here and there). > > > > [...] > > > > > You present an API without explaining, *at*all*, how it's supposed to be > > > used and I can't seem to figure it out from the implementation either :/ > > > > I tried to explain it in the doc patch that I followed up with: > > https://lore.kernel.org/patchwork/cover/1433967/#1632328 > > Urgh, you write RST :-( That sorta helps, but I'm still unclear on a > number of things, more below. > > > Or do you mean it more narrowly, i.e. I do not explain syscalls in > > detail? This assessment I agree with - my approach was/is to finalize > > the userpace API (libumcg) first, and make the userspace vs kernel > > decisions later. > > Yeah, I couldn't figure out how to use the syscalls and thus how to > interpret their implementation. A little more in the way of comments > would've been helpful. > > > For example, you wonder why there is no looping in umcg_wait > > (do_wait). This is because the looping happens in the userspace in > > libumcg. My overall approach was to make the syscalls as simple as > > possible and push extra logic to the userspace. > > So a simple comment on the syscall that says: > > Userspace is expected to do: > > do { > sys_umcg_wait(); > } while (smp_load_acquire(&umcg_task->state) != RUNNING); > > would've made all the difference. It provides context. > > > It seems that this approach is not resonating with kernel > > developers/maintainers - you are the third person asking why there is > > no looping in sys_umcg_wait, despite the fact that I explicitly > > mentioned pushing it out to the userspace. > > We've been trained, through years of 'funny' bugs, to go 'BUG BUG BUG' > when schedule() is not in a loop. And pushing the loop to userspace has > me all on edge for being 'weird'. > > > Let me try to make my case once more here. > > > > umcg_wait/umcg_wake: the RUNNABLE/RUNNING state changes, checks, and > > looping happen in the userspace (libumcg - see umcg_wait/umcg_wake in > > patch 5 here: https://lore.kernel.org/patchwork/patch/1433971/), while > > the syscalls simply sleep/wake. I find doing it in the userspace is > > much simpler and easier than in the kernel, as state reads and writes > > are just atomic memory accesses; in the kernel it becomes much more > > difficult - rcu locked sections, tasks locked, etc. > > Small difficulties as far as things go I think. The worst part is having > to do arch asm for the userspace cmpxchg. Luckily we can crib/share with > futex there. > > > On the other hand I agree that having syscalls more logically > > complete, in the sense that they do not require much hand-holding and > > retries from the userspace, is probably better from the API design > > perspective. My worry here is that state validation and retries in the > > userspace are unavoidable, and so going the usual way we will end up > > with retry loops both in the kernel and in the userspace. > > Can you expand on where you'd see the need for userspace to retry? > > The canonical case in my mind is where a task, that's been BLOCKED in > kernelspace transitions to UNBLOCK/RUNNABLE in return-to-user and waits > for RUNNING. > > Once it gets RUNNING, userspace can assume it can just go. It will never > have to re-check, because there's no way RUNNING can go away again. The > only way for RUNNING to become anything else, is setting it yourself > and/or doing a syscall. > > Also, by having the BLOCKED thing block properly in return-to-user, you > don't have to wrap *all* the userspace syscall invocations. If you let > it return early, you get to wrap syscalls, which is both fragile and > bad for performance. > > > So I pose this IMPORTANT QUESTION #1 to you that I hope to get a clear > > answer to: it is strongly preferable to have syscalls be "logically > > complete" in the sense that they retry things internally, and in > > generally try to cover all possible corner cases; or, alternatively, > > is it OK to make syscalls lightweight but "logically incomplete", and > > have the accompanied userspace wrappers do all of the heavy lifting > > re: state changes/validation, retries, etc.? > > Intuitively I'd go with complete. I'd have never even considered the > incomplete option. But let me try and get my head around the incomplete > cases. > > Oooh, I found the BLOCKED stuff, you hid it inside the grouping patch, > that makes no sense :-( Reason I'm looking is that I don't see how you > get around the blocked and runnable lists. You have to tell userspace > about them. > > FWIW: I think you placed umcg_on_block() wrong, it needs to be before > the terrible PI thing. Also, like said, please avoid yet another branch > here by using PF_UMCG_WORKER. > > > I see two additional benefits of thin/lightweight syscalls: > > - reading userspace state is needed much less often (e.g. my umcg_wait > > and umcg_wake syscalls do not access userspace data at all - also see > > my "second important question" below) > > It is also broken I think, best I can make of it is somsething like > this: > > WAIT WAKE > > if (smp_load_acquire(&state) == RUNNING) > return; > > state = RUNNING; > > > do { > sys_umcg_wait() > { > in_wait = true; > sys_umcg_wake() > { > if (in_wait) > wake_up_process() > } > set_current_state(INTERRUPTIBLE); > schedule(); > in_wait = false; > } > } while (smp_load_acquire(&state) != RUNNING); > > > missed wakeup, 'forever' stuck. You have to check your blocking > condition between setting state and scheduling. And if you do that, you > have a 'fat' syscall again. > > > - looping in the kernel, combined with reading/writing to userspace > > memory, can easily lead to spinning in the kernel (e.g. trying to > > atomically change a variable and looping until succeeding) > > I don't imagine spinning in kernel or userspace matters. > > > > Also, do we really need unregister over just letting a task > > > exit? Is there a sane use-case where task goes in and out of service? > > > > I do not know of a specific use case here. On the other hand, I do not > > know of a specific use case to unregister RSEQ, but the capability is > > there. Maybe the assumption is that the userspace memory passed to the > > kernel in register() may be freed before the task exits, and so there > > should be a way to tell the kernel to no longer use it? > > Fair enough I suppose. > > > > > > > > +450 common umcg_wait sys_umcg_wait > > > > +451 common umcg_wake sys_umcg_wake > > > > > > Right, except I'm confused by the proposed implementation. I thought the > > > whole point was to let UMCG tasks block in kernel, at which point we'd > > > change their state to BLOCKED and have userspace select another task to > > > run. Such BLOCKED tasks would then also be captured before they return > > > to userspace, i.e. the whole admission scheduler thing. > > > > > > I don't see any of that in these patches. So what are they actually > > > implementing? I can't find enough clues to tell :-( > > > > As I mentioned above, state changes are done in libumcg in userspace > > here: https://lore.kernel.org/patchwork/cover/1433967/#1632328 > > > > If you insist this logic should live in the kernel, I'll do it (grudgingly). > > So you have some of it, I just didn't find it because it's hidding in > that grouping thing. > > > > > +452 common umcg_swap sys_umcg_swap > > > > > > You're presenting it like a pure optimization, but IIRC this is what > > > enables us to frob the scheduler state to ensure the whole thing is seen > > > (to the rest of the system) as the M server tasks, instead of the > > > constellation of N+M worker and server tasks. > > > > Yes, you recall it correctly. > > > > > Also, you're not doing any of the frobbing required. > > > > This is because I consider the frobbing a (very) nice to have rather > > than a required feature, and so I am hoping to argue about how to > > properly do it in later patchsets. This whole thing (UMCG) will be > > extremely useful even without runtime accounting hacking and whatnot, > > and so I hope to have everything else settled and tested and merged > > before we spend another several weeks/months trying to make the > > frobbing perfect. > > Sure, not saying you need the frobbing from the get-go, but it's a much > stronger argument for having the API in the first place. So mentioning > this property (along with a TODO) is a stronger justification. > > This goes to *why again. It's fairly easy to see what from the code, but > code rarely explains why. > > That said; if we do: @next_pid, we might be able to do away with this. A > !RUNING transition will attempt to wake-and-switch to @next_tid. This is > BLOCKED from syscall or explicit using umcg_wait(). > > > > > +455 common umcg_poll_worker sys_umcg_poll_worker > > > > > > Shouldn't this be called idle or something, instead of poll, the whole > > > point of having this syscall is to that you can indeed go idle. > > > > That's another way of looking at it. Yes, this means the server idles > > until a worker becomes available. How would you call it? umcg_idle()? > > I'm trying to digest the thing; it's doing *far* more than just idling, > but yes, sys_umcg_idle() or something. > > > > This I'm confused about again.. there is no fundamental difference > > > between a worker or server, they're all the same. > > > > I don't see it this way. A server runs (on CPU) by itself and blocks > > when there is a worker attached; a worker runs (on CPU) only when it > > has a (blocked) server attached to it and, when the worker blocks, its > > server detaches and runs another worker. So workers and servers are > > the opposite of each other. > > So I was viewing the server more like the idle thread, its 'work' is > idle, which is always available. > > > > > + */ > > > > +#define UMCG_TASK_NONE 0 > > > > +/* UMCG server states. */ > > > > +#define UMCG_TASK_POLLING 1 > > > > +#define UMCG_TASK_SERVING 2 > > > > +#define UMCG_TASK_PROCESSING 3 > > > > > > I get POLLING, although per the above, this probably wants to be IDLE. > > > > Ack. > > > > > > > > What are the other two again? That is, along with the diagram, each > > > state wants a description. > > > > SERVING: the server is blocked, its attached worker is running > > PROCESSING: the server is running (= processing a block or wake > > event), has no running worker attached > > > > Both of these states are different from POLLING/IDLE and from each other. > > But if we view the server as the worker with work 'idle', then serving > becomes RUNNABLE and PROCESSING becomes RUNNING, right? > > And sys_run_worker(next); becomes: > > self->state = RUNNABLE; > self->next_tid = next; > sys_umcg_wait(); > > The question is if we need an explicit IDLE state along with calling > sys_umcg_idle(). I can't seem to make up my mind on that. > > > > > +/* UMCG worker states. */ > > > > +#define UMCG_TASK_RUNNABLE 4 > > > > +#define UMCG_TASK_RUNNING 5 > > > > +#define UMCG_TASK_BLOCKED 6 > > > > +#define UMCG_TASK_UNBLOCKED 7 > > > > > > Weird order, also I can't remember why we need the UNBLOCKED, isn't that > > > the same as the RUNNABLE, or did we want to distinguish the state were > > > we're no longer BLOCKED but the user scheduler hasn't yet put us on it's > > > ready queue (IOW, we're on the runnable_ptr list, see below). > > > > Yes, UNBLOCKED it a transitory state meaning the worker's blocking > > operation has completed, but the wake event hasn't been delivered to > > the userspace yet (and so the worker it not yet RUNNABLE) > > So if I understand the proposal correctly the only possible option is > something like: > > for (;;) { > next = user_sched_pick(); > if (next) { > sys_umcg_run(next); > continue; > } > > sys_umcg_poll(&next); > if (next) { > next->state = RUNNABLE; > user_sched_enqueue(next); > } > } > > This seems incapable of implementing generic scheduling policies and has > a hard-coded FIFO policy. > > The poll() thing cannot differentiate between: 'find new task' and 'go > idle'. So you cannot keep running it until all new tasks are found. > > But you basically get to do a syscall to discover every new task, while > the other proposal gets you a user visible list of new tasks, no > syscalls needed at all. > > It's also not quite clear to me what you do about RUNNING->BLOCKED, how > does the userspace scheduler know to dequeue a task? > > > My proposal gets you something like: > > for (;;) { > self->state = RUNNABLE; > self->next_tid = 0; // next == self == server -> idle > > p = xchg(self->blocked_ptr, NULL); > while (p) { > n = new->blocked_ptr; > user_sched_dequeue(p); > p = n; > } > > // Worker can have unblocked again before we got here, > // hence we need to process blocked before runnable. > // Worker cannot have blocked again, since we didn't > // know it was runnable, hence it cannot have ran again. > > p = xchg(self->runnable_ptr, NULL); > while (p) { > n = new->runnable_ptr; > user_sched_enqeue(p); > p = n; > } > > n = user_sched_pick(); > if (n) > self->next_tid = n->tid; > > // new self->*_ptr state will have changed self->state > // to RUNNING and we'll not switch to ->next. > > sys_umcg_wait(); > > // self->state == RUNNING > } > > This allows you to implement arbitrary policies and instantly works with > preemption once we implement that. Preemption would put the running > worker in RUNNABLE, mark the server RUNNING and switch. > > Hmm, looking at it written out like that, we don't need sys_umcg_wake(), > sys_umcg_swap() at all. > > Anyway, and this is how I got here, UNBLOCKED is not required because we > cannot run it before we've observed it RUNNABLE. Yes the state exists > where it's no longer BLOCKED, and it's not yet on the runqueue, but when > we don't know it's RUNNABLE we'll not pick it, so its moot. > > > > So last time I really looked at this it looked something like this: > > > > > > struct umcg_task { > > > u32 umcg_status; /* r/w */ > > > u32 umcg_server_tid; /* r */ > > > u32 umcg_next_tid; /* r */ > > > u32 __hole__; > > > u64 umcg_blocked_ptr; /* w */ > > > u64 umcg_runnable_ptr; /* w */ > > > }; > > > > > > (where r/w is from the kernel's pov) > > > (also see uapi/linux/rseq.h's ptr magic) > > > > I tried doing it this way, i.e. to only have only userspace struct > > added (without kernel-only data), and I found it really cumbersome and > > inconvenient and much slower than the proposed implementation. > > > For example, when a worker blocks, it seems working with "struct > > task_struct *peer" to get to the worker's server is easy and > > straightforward; reading server_tid from userspace, then looking up > > the task and only then doing what is needed (change state and wakeup) > > is ... unnecessary? > > Is find_task_by_vpid() really that slow? The advantage of having it in > userspace is that you can very easily change 'affinities' of the > workers. You can simply set ->server_tid and it goes elsewhere. > > > Also validating things becomes really important > > but difficult (what if the user put something weird in > > umcg_server_tid? or the ptr fields?). > > If find_task_by_vpid() returns NULL, we return -ESRCH. If the user > cmpxchg returns -EFAULT we pass along the message. If userspace put a > valid but crap pointer in it, userspace gets to keep the pieces. > > > In my proposed implementation only the state is user-writable, and it > > does not really affect most of the kernel-side work. > > > > Why do you think everything should be in the userspace memory? > > Because then we avoid all the kernel state and userspace gets to have > all the state without endless syscalls. > > Note that with the proposal, per the above, we're at: > > enum { > UMCG_STATE_RUNNING, > UMCG_STATE_RUNABLE, > UMCG_STATE_BLOCKED, > }; > > struct umcg_task { > u32 umcg_status; /* r/w */ > u32 umcg_server_tid; /* r */ > u32 umcg_next_tid; /* r */ > u32 umcg_tid; /* r */ > u64 umcg_blocked_ptr; /* w */ > u64 umcg_runnable_ptr; /* w */ > }; > > /* > * Register current's UMCG state. > */ > sys_umcg_register(struct umcg_task *self, unsigned int flags); > > /* > * Just 'cause. > */ > sys_umcg_unregister(struct umcg_task *self) > > /* > * UMCG context switch. > */ > sys_umcg_wait(u64 time, unsigned int flags) > { > unsigned int state = RUNNABLE; > unsigned int tid; > > if (self->state == RUNNING) > return; > > tid = self->next_tid; > if (!tid) > tid = self->server_tid; > > if (tid == self->server_tid && tid == self->tid) > return umcg_idle(time, flags); > > next = find_process_by_pid(tid); > if (!next) { > return -ESRCH; > > ret = user_try_cmpxchg(next->umcg->state, &state, RUNNING); > if (!ret) > ret = -EBUSY; > if (ret < 0) > return ret; > > return umcg_switch_to(next); > } > > With this (and the BLOCKING bits outlined last time) we can implement > full N:1 userspace scheduling (UP). > > ( Note that so far we assume all UMCG workers share the same address > space, otherwise the user_try_cmpxchg() doesn't work. ) > > And I _think_ you can do the whole SMP thing in userspace as well, just > have the servers share queue state and reassign ->server_tid where > needed. No additional syscalls required. > > > All of the code above assumes userspace-only data. I did not look into > > every detail of your suggestions because I want to make sure we first > > agree on this: do we keep every bit of information in the userspace > > (other than "struct umcg_task __user *" pointer in task_struct) or do > > we have some kernel-only details as well? > > Most of the kernel state you seem to have implemented seems to limit > flexibility / implement specific policy. All because apparently > find_task_by_vpid() is considered expensive? > > You've enangled the whole BLOCKING stuff with the SMP stuff. And by > putting that state in the kernel you've limited flexibility. > > Also, if you don't have kernel state it can't go out of sync and cause > problems. > > > So IMPORTANT QUESTION #2: why would we want to keep __everything__ in > > the userspace memory? I understand that CRIU would like this, but > > given that the implementation would at a minimum have to > > > > 1. read a umcg_server_ptr (points to the server's umcg_task) > > 2. get the server tid out of it (presumably by reading a field from > > the server's umcg_task; what if the tid is wrong?) > > 3. do a tid lookup > > So if we leave SMP as an exercise in scheduling queue management, And > implement the above, then you need: > > - copy_from_user()/get_user() for the first 4 words > - find_task_by_vpid() > > that gets you a task pointer, then we get to update a blocked_ptr. > > If anything goes wrong, simply return an error and let userspace sort it > out. > > > to get a task_struct pointer, it will be slower; I am also not sure it > > call be done safely at all: with kernel-side data and I can do rcu > > locking, task locking, etc. to ensure that the value I got does not > > change while I'm working with it; with userspace data, a lot of races > > will have to be specially coded for that can be easily handled by > > kernel-side rcu locks or spin locks... Maybe this is just my ignorance > > showing, and indeed things can be done simply and easily with > > userspace-only data, but I am not sure how. > > > > A common example: > > > > - worker W1 with server S1 calls umcg_wait() > > - worker W2 with server S2 calls umcg_swap(W1) > > > > If due to preemption and other concurrency weirdness the two syscalls > > above race with each other, each trying to change the server assigned > > to W1. I can easily handle the race by doing kernel-side locking; > > without kernel-side locking (cannot do rcu locks and/or spin locks > > while accessing userspace data) I am not sure how to handle the race. > > Maybe it is possible with careful atomic writes to states and looping > > to handle this specific race (what if the userspace antagonistically > > writes to the same location? will it force the syscall to spin > > indefinitely?); but with proper locking many potential races can be > > handled; with atomic ops and looping it is more difficult... Will we > > have to add a lock to struct umcg_task? And acquire it from the kernel > > side? And worry about spinning forever? > > What would you want locking for? I really don't see a problem here. > > Both blocked_ptr and runnable_ptr are cmpxchg single-linked-lists. Yes > they can spin a little, but that's not a new problem, futex has all > that. > > And ->state only needs single cmpxchg ops, no loops, either we got the > wakeup, or we didn't. The rest is done with memory ordering: > > > server worker > > self->state = RUNNABLE; self->state = BLOCKED; > > head = xchg(list, NULL) add_to_list(self, &server->blocked_ptr); > > if (try_cmpxchg_user(&server->umcg->state, RUNNABLE, RUNNING) > 0) > sys_umcg_wait() wake_up_process(server); > > > Either server sees the add, or we see it's RUNNABLE and wake it up > (or both). > > If anything on the BLOCKED side goes wrong (bad pointers, whatever), > have it segfault. > > <edit> Ooh, I forgot you can't just go wake the server when it's running > something else... so that does indeed need more states/complication, the > ordering argument stands though. We'll need something like > self->current_tid or somesuch </edit> > > > > > > +SYSCALL_DEFINE2(umcg_wait, u32, flags, > > > > + const struct __kernel_timespec __user *, timeout) > > > > > > I despise timespec, tglx? > > > > What are the alternatives? I just picked what the futex code uses. > > u64 nanoseconds. Not sure tglx really wants to do that though, but > still, timespec is a terrible thing. > > > In summary, two IMPORTANT QUESTIONS: > > > > 1. thin vs fat syscalls: can we push some code/logic to the userspace > > (state changes, looping/retries), or do we insist on syscalls handling > > everything? > > Well, the way I see it it's a trade of what is handled where. I get a > smaller API (although I'm sure I've forgotten something trivial again > that wrecks everything <edit> I did :/ </edit>) and userspace gets to > deal with all of SMP and scheduling policies. > > You hard-coded a global-fifo and had a enormous number of syscalls and > needed to wrap every syscall invocation in order to fix up the return. > > > Please have in mind that even if we choose the second > > approach (fat syscalls), the userspace will most likely still have to > > do everything it does under the first option just to handle > > signals/interrupts (i.e. unscheduled wakeups); > > IIRC sigreturn goes back into the kernel and we can resume blocking > there. > > > 2. kernel-side data vs userspace-only: can we avoid having kernel-side > > data? More specifically, what alternatives to rcu_read_lock and/or > > task_lock are available when working with userspace data? > > What would you want locked and why? > > > Anyway, this email is far too long again (basically took me all day :/), > hope it helps a bit. Thomas is stuck fixing XSAVE disasters, but I'll > ask him to chime in once that's done. > >