On Tue, May 23, 2023 at 12:30 PM Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> wrote: > > On 2023-05-23 12:32, Noah Goldstein wrote: > > On Tue, May 23, 2023 at 7:49 AM Mathieu Desnoyers > > <mathieu.desnoyers@xxxxxxxxxxxx> wrote: > >> > >> On 2023-05-19 16:51, Noah Goldstein wrote: > >>> On Wed, May 17, 2023 at 10:28 AM Mathieu Desnoyers via Libc-alpha > >>> <libc-alpha@xxxxxxxxxxxxxx> wrote: > >>>> > >>>> Expose the "on-cpu" state for each thread through struct rseq to allow > >>>> adaptative mutexes to decide more accurately between busy-waiting and > >>>> calling sys_futex() to release the CPU, based on the on-cpu state of the > >>>> mutex owner. > >>>> > >>>> It is only provided as an optimization hint, because there is no > >>>> guarantee that the page containing this field is in the page cache, and > >>>> therefore the scheduler may very well fail to clear the on-cpu state on > >>>> preemption. This is expected to be rare though, and is resolved as soon > >>>> as the task returns to user-space. > >>>> > >>>> The goal is to improve use-cases where the duration of the critical > >>>> sections for a given lock follows a multi-modal distribution, preventing > >>>> statistical guesses from doing a good job at choosing between busy-wait > >>>> and futex wait behavior. > >>>> > >>>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> > >>>> Cc: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx> > >>>> Cc: Jonathan Corbet <corbet@xxxxxxx> > >>>> Cc: Steven Rostedt (Google) <rostedt@xxxxxxxxxxx> > >>>> Cc: Carlos O'Donell <carlos@xxxxxxxxxx> > >>>> Cc: Florian Weimer <fweimer@xxxxxxxxxx> > >>>> Cc: libc-alpha@xxxxxxxxxxxxxx > >>>> --- > >>>> include/linux/sched.h | 12 ++++++++++++ > >>>> include/uapi/linux/rseq.h | 17 +++++++++++++++++ > >>>> kernel/rseq.c | 14 ++++++++++++++ > >>>> 3 files changed, 43 insertions(+) > >>>> > >>>> diff --git a/include/linux/sched.h b/include/linux/sched.h > >>>> index eed5d65b8d1f..c7e9248134c1 100644 > >>>> --- a/include/linux/sched.h > >>>> +++ b/include/linux/sched.h > >>>> @@ -2351,11 +2351,20 @@ static inline void rseq_signal_deliver(struct ksignal *ksig, > >>>> rseq_handle_notify_resume(ksig, regs); > >>>> } > >>>> > >>>> +void __rseq_set_sched_state(struct task_struct *t, unsigned int state); > >>>> + > >>>> +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state) > >>>> +{ > >>>> + if (t->rseq) > >>>> + __rseq_set_sched_state(t, state); > >>>> +} > >>>> + > >>>> /* rseq_preempt() requires preemption to be disabled. */ > >>>> static inline void rseq_preempt(struct task_struct *t) > >>>> { > >>>> __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask); > >>>> rseq_set_notify_resume(t); > >>>> + rseq_set_sched_state(t, 0); > >>> > >>> Should rseq_migrate also be made to update the cpu_id of the new core? > >>> I imagine the usage of this will be something along the lines of: > >>> > >>> if(!on_cpu(mutex->owner_rseq_struct) && > >>> cpu(mutex->owner_rseq_struct) == this_threads_cpu) > >>> // goto futex > >>> > >>> So I would think updating on migrate would be useful as well. > >> > >> I don't think we want to act differently based on the cpu on which the > >> owner is queued. > >> > >> If the mutex owner is not on-cpu, and queued on the same cpu as the > >> current thread, we indeed want to call sys_futex WAIT. > >> > >> If the mutex owner is not on-cpu, but queued on a different cpu than the > >> current thread, we *still* want to call sys_futex WAIT, because > >> busy-waiting for a thread which is queued but not currently running is > >> wasteful. > >> > > I think this is less clear. In some cases sure but not always. Going > > to the futex > > has more latency that userland waits, and if the system is not busy (other than > > the one process) most likely less latency that yield. Also going to the futex > > requires a syscall on unlock. > > > > For example if the critical section is expected to be very small, it > > would be easy > > to imagine the lock be better implemented with: > > while(is_locked) > > if (owner->on_cpu || owner->cpu != my_cpu) > > exponential backoff > > else > > yield > > > > Its not that "just go to futex" doesn't ever make sense, but I don't > > think its fair > > to say that *always* the case. > > > > Looking at the kernel code, it doesn't seem to be a particularly high cost to > > keep the CPU field updated during migration so seems like a why not > > kind of question. > > We already have the owner rseq_abi cpu_id field populated on every > return-to-userspace. I wonder if it's really relevant that migration > populates an updated value in this field immediately ? It's another case > where this would be provided as a hint updated only if the struct rseq > is in the page cache, because AFAIU the scheduler migration path cannot > take a page fault. > Ah, thats a good point. And probably as probability the page is in the cache goes down a fair bit as the task is idle / bounced around for longer. > Also, if a thread bounces around many runqueues before being scheduled > again, we would be adding those useless stores to the rseq_abi structure > at each migration between runqueues. > > Given this would add some complexity to the scheduler migration code, I > would want to see metrics/benchmarks showing that it indeed improves > real-world use-cases before adding this to the rseq ABI. > > It's not only a question of added lines of code as of today, but also a > question of added userspace ABI guarantees which can prevent future > scheduler optimizations. I'm *very* careful about keeping those to a > strict minimum, which I hope Peter Zijlstra appreciates. Well, this entire thing is moreso a hint than a guarantee. Even on_cpu is only updated if the page happens to be in the pagecache so I don't see how you could ever be *having* to do anything. But fair enough, thought I'd throw the idea out there, but enough valid concerns seem to make it not such a good idea. > > Thanks, > > Mathieu > > > >> Or am I missing something ? > >> > >> Thanks, > >> > >> Mathieu > >> > >> -- > >> Mathieu Desnoyers > >> EfficiOS Inc. > >> https://www.efficios.com > >> > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >