On 9/28/23 15:20, Mathieu Desnoyers wrote:
On 9/28/23 07:22, David Laight wrote:
From: Peter Zijlstra
Sent: 28 September 2023 11:39
On Mon, May 29, 2023 at 03:14:13PM -0400, Mathieu Desnoyers wrote:
Expose the "on-cpu" state for each thread through struct rseq to allow
adaptative mutexes to decide more accurately between busy-waiting and
calling sys_futex() to release the CPU, based on the on-cpu state
of the
mutex owner.
Are you trying to avoid spinning when the owning process is sleeping?
Yes, this is my main intent.
Or trying to avoid the system call when it will find that the futex
is no longer held?
The latter is really horribly detremental.
That's a good questions. What should we do in those three situations
when trying to grab the lock:
1) Lock has no owner
We probably want to simply grab the lock with an atomic instruction.
But then if other threads are queued on sys_futex and did not manage
to grab the lock yet, this would be detrimental to fairness.
2) Lock owner is running:
The lock owner is certainly running on another cpu (I'm using the term
"cpu" here as logical cpu).
I guess we could either decide to bypass sys_futex entirely and try to
grab the lock with an atomic, or we go through sys_futex nevertheless
to allow futex to guarantee some fairness across threads.
About the fairness part:
Even if you enqueue everyone, the futex syscall doesn't provide any
guarantee about the order of the wake. The current implementation tries
to be fair, but I don't think it works for every case. I wouldn't be
much concern about being fair here, given that it's an inherent problem
of the futex anyway.
From the man pages:
"No guarantee is provided about which waiters are awoken"
3) Lock owner is sleeping:
The lock owner may be either tied to the same cpu as the requester, or
a different cpu. Here calling FUTEX_WAIT and friends is pretty much
required.
Can you elaborate on why skipping sys_futex in scenario (2) would be
so bad ? I wonder if we could get away with skipping futex entirely in
this scenario and still guarantee fairness by implementing MCS locking
or ticket locks in userspace. Basically, if userspace queues itself on
the lock through either MCS locking or ticket locks, it could
guarantee fairness on its own.
Of course things are more complicated with PI-futex, is that what you
have in mind ?
It is only provided as an optimization hint, because there is no
guarantee that the page containing this field is in the page cache,
and
therefore the scheduler may very well fail to clear the on-cpu
state on
preemption. This is expected to be rare though, and is resolved as
soon
as the task returns to user-space.
The goal is to improve use-cases where the duration of the critical
sections for a given lock follows a multi-modal distribution,
preventing
statistical guesses from doing a good job at choosing between
busy-wait
and futex wait behavior.
As always, are syscalls really *that* expensive? Why can't we busy wait
in the kernel instead?
I mean, sure, meltdown sucked, but most people should now be running
chips that are not affected by that particular horror show, no?
IIRC 'page table separation' which is what makes system calls expensive
is only a compile-time option. So is likely to be enabled on any
'distro'
kernel.
But a lot of other mitigations (eg RSB stuffing) are also pretty
detrimental.
OTOH if you have a 'hot' userspace mutex you are going to lose whatever.
All that needs to happen is for a ethernet interrupt to decide to
discard
completed transmits and refill the rx ring, and then for the softint
code
to free a load of stuff deferred by rcu while you've grabbed the mutex
and no matter how short the user-space code path the mutex won't be
released for absolutely ages.
I had to change a load of code to use arrays and atomic increments
to avoid delays acquiring mutex.
That's good input, thanks! I mostly defer to André Almeida on the
use-case motivation. I mostly provided this POC patch to show that it
_can_ be done with sys_rseq(2).
Thanks!
Mathieu
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes,
MK1 1PT, UK
Registration No: 1397386 (Wales)