On Mon, Jul 12, 2021 at 2:44 PM Thierry Delisle <tdelisle@xxxxxxxxxxxx> wrote: > > > sys_umcg_wait without next_tid puts the task in UMCG_IDLE state; wake > > wakes it. These are standard sched operations. If they are emulated > > via futexes, fast context switching will require something like > > FUTEX_SWAP that was NACKed last year. > > I understand these wait and wake semantics and the need for the fast > context-switch(swap). As I see it, you need 3 operations: > > - SWAP: context-switch directly to a different thread, no scheduler involved > - WAIT: block current thread, go back to server thread > - WAKE: unblock target thread, add it to scheduler, e.g. through > idle_workers_ptr > > There is no existing syscalls to handle SWAP, so I agree sys_umcg_wait is > needed for this to work. > > However, there already exists sys_futex to handle WAIT and WAKE. When a > worker > calls either sys_futex WAIT or sys_umcg_wait next_tid == NULL, in both case > the worker will block, SWAP to the server and wait for FUTEX_WAKE, > UMCG_WAIT_WAKE_ONLY respectively. It's not obvious to me that there > would be > performance difference and the semantics seem to be the same to me. > > So what I am asking is: is UMCG_WAIT_WAKE_ONLY needed? Because the approach you described has been tried last year and was NACKed: https://lore.kernel.org/lkml/20200722234538.166697-1-posk@xxxxxxx/ In short, futex maintainers do not want to touch the existing futex code at all other than for bugfixes. No new futex functionality, period. See e.g. futex2 efforts: https://lore.kernel.org/lkml/20210603195924.361327-1-andrealmeid@xxxxxxxxxxxxx/ > Is the idea to support workers directly context-switching among each other, > without involving server threads and without going through idle_servers_ptr? Yes. > If so, can you explain some of the intended state transitions in this case. Cooperative scheduling: workers can yield to each other (i.e. swap). This allows building very fast concurrency primitives (mutexes, producer-consumer queues, etc.). For example: a producer-consumer queue: a producer: prepare an item, contex-switch to a consumer on CPU synchronously: this can be done much faster than waking the consumer asynchronously; helps a lot to reduce latency (throughput? not so much) > > > > > However, I do not understand how the userspace is expected to use > it. I also > > > do not understand if these link fields form a stack or a queue and > where is > > > the head. > > > > When a server has nothing to do (no work to run), it is put into IDLE > > state and added to the list. The kernel wakes an IDLE server if a > > blocked worker unblocks. > > From the code in umcg_wq_worker_running (Step 3), I am guessing users are > expected to provide a global head somewhere in memory and > umcg_task.idle_servers_ptr points to the head of the list for all workers. > Servers are then added in user space using atomic_stack_push_user. Is this > correct? I did not find any documentation on the list head. This is going to change somewhat. I'll post a new patchset soon-ish (later this week?) > > I like the idea that each worker thread points to a given list, it > allows the > possibility for separate containers with their own independent servers, > workers > and scheduling. However, it seems that the list itself could be implemented > using existing kernel APIs, for example a futex or an event fd. Like so: > > struct umcg_task { > [...] > > /** > * @idle_futex_ptr: pointer to a futex user for idle server threads. > * > * When waking a worker, the kernel decrements the pointed to > futex value > * if it is non-zero and wakes a server if the decrement occurred. > * > * Server threads that have no work to do should increment the futex > * value and FUTEX_WAIT > */ > uint64_t idle_futex_ptr; /* r/w */ > > [...] > } __attribute__((packed, aligned(8 * sizeof(__u64)))); > > I believe the futex approach, like the list, has the advantage that when > there > are no idle servers, checking the list requires no locking. I don't know if > that can be achieved with eventfd. As mentioned above, a futex-based solution was rejected by maintainers. Believe me, I tried. Not only tried, we have a working userspace scheduling stack on top of FUTEX_SWAP deployed internally at Google, with some actual usage (mostly testing/debugging workloads). I suggest we stop discussing futexes in this context - I do not see the maintainers changing their position. And the approach used in this patchset seems to work (and I actually like how the code is shaping up).