Re: [RFC PATCH v2 1/4] rseq: Add sched_state field to struct rseq

Dmitry Vyukov <dvyukov@xxxxxxxxxx> · Tue, 26 Sep 2023 22:52:15 +0200

>> I don't see why we can't stick this directly into struct rseq because
>> it's all public anyway.
>
> The motivation for moving this to a different cache line is to handle 
> the prior comment from Boqun, who is concerned that busy-waiting 
> repeatedly loading a field from struct rseq will cause false-sharing and 
> make other stores to that cache line slower, especially stores to 
> rseq_cs to begin rseq critical sections, thus slightly increasing the 
> overhead of rseq critical sections taken while mutexes are held.
>
> If we want to embed this field into struct rseq with its own cache line, 
> then we need to add a lot of padding, which is inconvenient.
>
> That being said, perhaps this is premature optimization, what do you think ?

Hi Mathieu, Florian,

This is exciting!

I thought the motivation for moving rseq_sched_state out of struct rseq
is lifetime management problem. I assume when a thread locks a mutex,
it stores pointer to rseq_sched_state in the mutex state for other
threads to poll. So the waiting thread would do something along the following
lines:

rseq_sched_state* state = __atomic_load_n(mutex->sched_state, __ATOMIC_RELAXED);
if (state && !(state->state & RSEQ_SCHED_STATE_FLAG_ON_CPU))
	futex_wait();

Now if the state is struct rseq, which is stored in TLS,
then the owning thread can unlock the mutex, exit and unmap TLS in between.
Consequently, load of state->state will cause a paging fault.

And we do want rseq in TLS to save 1 indirection.

If rseq_sched_state is separated from struct rseq, then it can be allocated
in type stable memory that is never unmapped.

What am I missing here?

However, if we can store this state in struct rseq, then an alternative
interface would for the kernel to do:

rseq->cpu_id = -1;

to denote that the thread is not running on any CPU.
I think it kinda makes sense, rseq->cpu_id is the thread's current CPU,
and -1 naturally means "not running at all". And we already store -1
right after init, so it shouldn't be a surprising value.