Sorry, I haven't been feeling too well and as such procastinated on this because thinking is required :/ Trying to pick up the bits. On Mon, Nov 29, 2021 at 03:38:38PM -0800, Peter Oskolkov wrote: > On Mon, Nov 29, 2021 at 1:08 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > [...] > > > > > Another big concern I have is that you removed UMCG_TF_LOCKED. I > > > > > > > > OOh yes, I forgot to mention that. I couldn't figure out what it was > > > > supposed to do. > [...] > > > > So then A does: > > > > A::next_tid = C.tid; > > sys_umcg_wait(); > > > > Which will: > > > > pin(A); > > pin(S0); > > > > cmpxchg(A::state, RUNNING, RUNNABLE); > > Hmm.... That's another difference between your patch and mine: my > approach was "the side that initiates the change updates the state". > So in my code the userspace changes the current task's state RUNNING > => RUNNABLE and the next task's state, I couldn't make that work for wakeups; when a thread blocks in a random syscall there is no userspace to wake the next thread. And since it seems required in this case, it's easier and more consistent to always do it. > or the server's state, RUNNABLE > => RUNNING before calling sys_umcg_wait(). Yes, this is indeed required; I've found the same when trying to build the userspace server loop. And yes, I'm starting to see where you're coming from. > I'm still not sure we can live without UMCG_TF_LOCKED. What if worker > A transfers its server to worker B that A intends to context switch S0 running A Therefore: S0::state == RUNNABLE A::server_tid = S0.tid A::state == RUNNING you want A to switch to B, therefore: B::state == RUNNABLE if B is not yet on S0 then: B::server_tid = S0.tid; finally: 0: A::next_tid = B.tid; 1: A::state = RUNNABLE: 2: sys_umcg_wait(); 3: > into, and then worker A pagefaults or gets interrupted before calling > sys_umcg_wait()? So the problem is tripping umcg_notify_resume() on the labels 1 and 2, right? tripping it on 0 and 3 is trivially correct. If we trip it on 1 and !(A::state & TG_PREEMPT), then nothing, since ::state == RUNNING we'll just continue onwards and all is well. That is, nothing has happened yet. However, if we trip it on 2: we're screwed. Because at that point ::state is scribbled. > The server will be woken up and will see that it is > assigned to worker B; now what? If worker A is "locked" before the > whole thing starts, the pagefault/interrupt will not trigger > block/wake detection, worker A will keep RUNNING for all intended > purposes, and eventually will call sys_umcg_wait() as it had > intended... No, the failure case is different; umcg_notify_resume() will simply block A until someone sets A::state == RUNNING and kicks it, which will be no-one. Now, the above situation is actually simple to fix, but it gets more interesting when we're using sys_umcg_wait() to build wait primitives. Because in that case we get stuff like: for (;;) { self->state = RUNNABLE; smp_mb(); if (cond) break; sys_umcg_wait(); } self->state = RUNNING; And we really need to not block and also not do sys_umcg_wait() early. So yes, I agree that we need a special case here that ensures umcg_notify_resume() doesn't block. Let me ponder naming and comments. Either a TF_COND_WAIT or a whole new state. I can't decide yet. Now, obviously if you do a random syscall anywhere around here, you get to keep the pieces :-) I've also added ::next_tid to the whole umcg_pin_pages() thing, and made it so that ::next_tid gets cleared when it's been used. That way things like: self->next_tid = pick_from_runqueue(); sys_that_is_expected_to_sleep(); if (self->next_tid) { return_to_runqueue(self->next_tid); self->next_tid = 0; } Are much simpler to manage. Either it did sleep and ::next_tid is consumed, or it didn't sleep and it needs to be returned to the runqueue.