On Thu, Jul 8, 2021 at 2:12 PM Jann Horn <jannh@xxxxxxxxxx> wrote: > > On Thu, Jul 8, 2021 at 9:46 PM Peter Oskolkov <posk@xxxxxxx> wrote: > > Add helper functions to work atomically with userspace 32/64 bit values - > > there are some .*futex.* named helpers, but they are not exactly > > what is needed for UMCG; I haven't found what else I could use, so I > > rolled these. > > > > At the moment only X86_64 is supported. > > > > Note: the helpers should probably go into arch/ somewhere; I have > > them in kernel/sched/umcg.h temporarily for convenience. Please > > let me know where I should put them and how to name them. > > Instead of open-coding spinlocks in userspace memory like this (which > some of the reviewers will probably dislike because it will have > issues around priority inversion and such), I wonder whether you could > use an actual futex as your underlying locking primitive? > > The most straightforward way to do that would probably be to make the > head structure in userspace look roughly like this? > > struct umcg_head { > u64 head_ptr; > u32 lock; > }; > > and then from kernel code, you could build a fastpath that directly > calls cmpxchg_futex_value_locked() and build a fallback based on > do_futex(), or something like that. > > There is precedent for using futex from inside the kernel to > communicate with userspace: See mm_release(), which calls do_futex() > with FUTEX_WAKE for the clear_child_tid feature. Hi Jann, Thanks for the note! The approach you suggest will require locking every operation, I believe, while in the scheme I have pushes/inserts are lock-free if there are no concurrent pops/deletes. And the kernel does mostly pushes (waking workers, and there can be a lot of workers), while pops are rare (idle servers, and there is no reason for the number of servers to exceed the number of CPUs substantially, and if there is contention here, it will be very short-lived), while the userspace will pop the entire stack of idle workers in one op (so a short lock as well). So I think my approach scales better. And priority inversion should not matter here, because this is for userspace scheduling, and so the userspace scheduler should worry about it, not the kernel. Am I missing something? Thanks, Peter