Re: [RFC PATCH v2 1/4] rseq: Add sched_state field to struct rseq

Dmitry Vyukov <dvyukov@xxxxxxxxxx> · Thu, 28 Sep 2023 07:47:15 -0700

On Tue, 26 Sept 2023 at 16:49, Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote:
>
> > >> I don't see why we can't stick this directly into struct rseq because
> > >> it's all public anyway.
> > >
> > > The motivation for moving this to a different cache line is to handle
> > > the prior comment from Boqun, who is concerned that busy-waiting
> > > repeatedly loading a field from struct rseq will cause false-sharing and
> > > make other stores to that cache line slower, especially stores to
> > > rseq_cs to begin rseq critical sections, thus slightly increasing the
> > > overhead of rseq critical sections taken while mutexes are held.
> > >
> > > If we want to embed this field into struct rseq with its own cache line,
> > > then we need to add a lot of padding, which is inconvenient.
> > >
> > > That being said, perhaps this is premature optimization, what do you think ?
> >
> > Hi Mathieu, Florian,
> >
> > This is exciting!
> >
> > I thought the motivation for moving rseq_sched_state out of struct rseq
> > is lifetime management problem. I assume when a thread locks a mutex,
> > it stores pointer to rseq_sched_state in the mutex state for other
> > threads to poll. So the waiting thread would do something along the following
> > lines:
> >
> > rseq_sched_state* state = __atomic_load_n(mutex->sched_state, __ATOMIC_RELAXED);
> > if (state && !(state->state & RSEQ_SCHED_STATE_FLAG_ON_CPU))
> >         futex_wait();
> >
> > Now if the state is struct rseq, which is stored in TLS,
> > then the owning thread can unlock the mutex, exit and unmap TLS in between.
> > Consequently, load of state->state will cause a paging fault.
> >
> > And we do want rseq in TLS to save 1 indirection.
> >
> > If rseq_sched_state is separated from struct rseq, then it can be allocated
> > in type stable memory that is never unmapped.
> >
> > What am I missing here?
> >
> > However, if we can store this state in struct rseq, then an alternative
> > interface would for the kernel to do:
> >
> > rseq->cpu_id = -1;
> >
> > to denote that the thread is not running on any CPU.
> > I think it kinda makes sense, rseq->cpu_id is the thread's current CPU,
> > and -1 naturally means "not running at all". And we already store -1
> > right after init, so it shouldn't be a surprising value.
>
> As you may know we experimented with "virtual CPUs" in tcmalloc. The
> extension allows kernel to assign dense virtual CPU numbers to running
> threads instead of real sparse CPU numbers:
>
> https://github.com/google/tcmalloc/blob/229908285e216cca8b844c1781bf16b838128d1b/tcmalloc/internal/linux_syscall_support.h#L30-L41
>
> Recently I added another change that [ab]uses rseq in an interesting
> way. We want to get notifications about thread re-scheduling. A bit
> simplified version of this is as follows:
> we don't use rseq.cpu_id_start for its original purpose, so instead we
> store something else there with a high bit set. Real CPU numbers don't
> have a high bit set (at least while you have less than 2B CPUs :)).
> This allows us to distinguish the value we stored in rseq.cpu_id_start
> from real CPU id stored by the kernel.
> Inside of rseq critical section we check if rseq.cpu_id_start has high
> bit set, and if not, then we know that we were just rescheduled, so we
> can do some additional work and update rseq.cpu_id_start to have high
> bit set.
>
> In reality it's a bit more involved since the field is actually 8
> bytes and only partially overlaps with rseq.cpu_id_start (it's an
> 8-byte pointer with high 4 bytes overlap rseq.cpu_id_start):
>
> https://github.com/google/tcmalloc/blob/229908285e216cca8b844c1781bf16b838128d1b/tcmalloc/internal/percpu.h#L101-L165
>
> I am thinking if we could extend the current proposed interface in a
> way that would be more flexible and would satisfy all of these use
> cases (spinlocks, and possibility of using virtual CPUs and
> rescheduling notifications). In the end they all need a very similar
> thing: kernel writing some value at some user address when a thread is
> de-scheduled.
>
> The minimal support we need for tcmalloc is an 8-byte user address +
> kernel writing 0 at that address when a thread is descheduled.
>
> The most flexible option to support multiple users
> (malloc/spinlocks/something else) would be as follows:
>
> User-space passes an array of structs with address + size (1/2/4/8
> bytes) + value.
> Kernel intereates over the array when the thread is de-scheduled and
> writes the specified value at the provided address/size.
> Something along the following lines (pseudo-code):
>
> struct rseq {
>     ...
>     struct rseq_desched_notif_t* desched_notifs;
>     int desched_notif_count;
> };
>
> struct rseq_desched_notif_t {
>     void* addr;
>     uint64_t value;
>     int size;
> };
>
> static inline void rseq_preempt(struct task_struct *t)
> {
>     ...
>     for (int i = 0; i < t->rseq->desched_notif_count; i++) {
>         switch (t->rseq->desched_notifs[i].size) {
>         case 1: put_user1(t->rseq->desched_notifs[i].addr,
> t->rseq->desched_notifs[i].value);
>         case 2: put_user2(t->rseq->desched_notifs[i].addr,
> t->rseq->desched_notifs[i].value);
>         case 4: put_user4(t->rseq->desched_notifs[i].addr,
> t->rseq->desched_notifs[i].value);
>         case 8: put_user8(t->rseq->desched_notifs[i].addr,
> t->rseq->desched_notifs[i].value);
>         }
>     }
> }

One thing I forgot to mention: ideally the kernel also writes a
timestamp of descheduling somewhere.
We are using this logic to assign per-CPU malloc caches to threads,
and it's useful to know which caches were used very recently (still
hot in cache) and which ones were not used for a long time.