On Sun, Nov 28, 2021 at 04:29:11PM -0800, Peter Oskolkov wrote: > wait_wake_only is not needed if you have both next_tid and server_tid, > as your patch has. In my version of the patch, next_tid is the same as > server_tid, so the flag is needed to indicate to the kernel that > next_tid is the wakee, not the server. Ah, okay. > re: (idle_)server_tid_ptr: it seems that you assume that blocked > workers keep their servers, while in my patch they "lose them" once > they block, and so there should be a global idle server pointer to > wake the server in my scheme (if there is an idle one). The main > difference is that in my approach a server has only a single, running, > worker assigned to it, while in your approach it can have a number of > blocked/idle workers to take care of as well. Correct; I've been thinking in analogues of the way we schedule CPUs. Each CPU has a ready/run queue along with the current task. fundamentally the RUNNABLE tasks need to go somewhere when all servers are busy. So at that point the previous server is as good a place as any. Now, I sympathise with a blocked task not having a relation; I often argue this same, since we have wakeup balancing etc. And I've not really thought about how to best do wakeup-balancing, also see below. > The main difference between our approaches, as I see it: in my > approach if a worker is running, its server is sleeping, period. If we > have N servers, and N running workers, there are no servers to wake > when a previously blocked worker finishes its blocking op. In your > approach, it seems that N servers have each a bunch of workers > pointing at them, and a single worker running. If a previously blocked > worker wakes up, it wakes the server it was assigned to previously, Right; it does that. It can check the ::state of it's current task, possibly set TF_PREEMPT or just go back to sleep. > and so now we have more than N physical tasks/threads running: N > workers and the woken server. This is not ideal: if the process is > affined to only N CPUs, that means a worker will be preempted to let > the woken server run, which is somewhat against the goal of letting > the workers run more or less uninterrupted. This is not deal breaking, > but maybe something to keep in mind. I suppose it's easy enough to make this behaviour configurable though; simply enqueue and not wake.... Hmm.. how would this worker know if the server was 'busy' or not? The whole 'current' thing is a user-space construct. I suppose that's what your pointer was for? Puts an actual idle server in there, if there is one. Let me ponder that a bit. However, do note this whole scheme fundamentally has some of that, the moment the syscall unblocks until sys_exit is 'unmanaged' runtime for all tasks, they can consume however much time the syscall needs there. Also, timeout on sys_umcg_wait() gets you the exact same situation (or worse, multiple running workers). > Another big concern I have is that you removed UMCG_TF_LOCKED. I OOh yes, I forgot to mention that. I couldn't figure out what it was supposed to do. > definitely needed it to guard workers during "sched work" in the > userspace in my approach. I'm not sure if the flag is absolutely > needed with your approach, but most likely it is - the kernel-side > scheduler does lock tasks and runqueues and disables interrupts and > migrations and other things so that the scheduling logic is not > hijacked by concurrent stuff. Why do you assume that the userspace > scheduling code does not need similar protections? I've not yet come across a case where this is needed. Migration for instance is possible when RUNNABLE, simply write ::server_tid before ::state. Userspace just needs to make sure who actually owns the task, but it can do that outside of this state. But like I said; I've not yet done the userspace part (and I lost most of today trying to install a new machine), so perhaps I'll run into it soon enough.