On Tue, 26 Sept 2023 at 16:49, Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote: > > > >> I don't see why we can't stick this directly into struct rseq because > > >> it's all public anyway. > > > > > > The motivation for moving this to a different cache line is to handle > > > the prior comment from Boqun, who is concerned that busy-waiting > > > repeatedly loading a field from struct rseq will cause false-sharing and > > > make other stores to that cache line slower, especially stores to > > > rseq_cs to begin rseq critical sections, thus slightly increasing the > > > overhead of rseq critical sections taken while mutexes are held. > > > > > > If we want to embed this field into struct rseq with its own cache line, > > > then we need to add a lot of padding, which is inconvenient. > > > > > > That being said, perhaps this is premature optimization, what do you think ? > > > > Hi Mathieu, Florian, > > > > This is exciting! > > > > I thought the motivation for moving rseq_sched_state out of struct rseq > > is lifetime management problem. I assume when a thread locks a mutex, > > it stores pointer to rseq_sched_state in the mutex state for other > > threads to poll. So the waiting thread would do something along the following > > lines: > > > > rseq_sched_state* state = __atomic_load_n(mutex->sched_state, __ATOMIC_RELAXED); > > if (state && !(state->state & RSEQ_SCHED_STATE_FLAG_ON_CPU)) > > futex_wait(); > > > > Now if the state is struct rseq, which is stored in TLS, > > then the owning thread can unlock the mutex, exit and unmap TLS in between. > > Consequently, load of state->state will cause a paging fault. > > > > And we do want rseq in TLS to save 1 indirection. > > > > If rseq_sched_state is separated from struct rseq, then it can be allocated > > in type stable memory that is never unmapped. > > > > What am I missing here? > > > > However, if we can store this state in struct rseq, then an alternative > > interface would for the kernel to do: > > > > rseq->cpu_id = -1; > > > > to denote that the thread is not running on any CPU. > > I think it kinda makes sense, rseq->cpu_id is the thread's current CPU, > > and -1 naturally means "not running at all". And we already store -1 > > right after init, so it shouldn't be a surprising value. > > As you may know we experimented with "virtual CPUs" in tcmalloc. The > extension allows kernel to assign dense virtual CPU numbers to running > threads instead of real sparse CPU numbers: > > https://github.com/google/tcmalloc/blob/229908285e216cca8b844c1781bf16b838128d1b/tcmalloc/internal/linux_syscall_support.h#L30-L41 > > Recently I added another change that [ab]uses rseq in an interesting > way. We want to get notifications about thread re-scheduling. A bit > simplified version of this is as follows: > we don't use rseq.cpu_id_start for its original purpose, so instead we > store something else there with a high bit set. Real CPU numbers don't > have a high bit set (at least while you have less than 2B CPUs :)). > This allows us to distinguish the value we stored in rseq.cpu_id_start > from real CPU id stored by the kernel. > Inside of rseq critical section we check if rseq.cpu_id_start has high > bit set, and if not, then we know that we were just rescheduled, so we > can do some additional work and update rseq.cpu_id_start to have high > bit set. > > In reality it's a bit more involved since the field is actually 8 > bytes and only partially overlaps with rseq.cpu_id_start (it's an > 8-byte pointer with high 4 bytes overlap rseq.cpu_id_start): > > https://github.com/google/tcmalloc/blob/229908285e216cca8b844c1781bf16b838128d1b/tcmalloc/internal/percpu.h#L101-L165 > > I am thinking if we could extend the current proposed interface in a > way that would be more flexible and would satisfy all of these use > cases (spinlocks, and possibility of using virtual CPUs and > rescheduling notifications). In the end they all need a very similar > thing: kernel writing some value at some user address when a thread is > de-scheduled. > > The minimal support we need for tcmalloc is an 8-byte user address + > kernel writing 0 at that address when a thread is descheduled. > > The most flexible option to support multiple users > (malloc/spinlocks/something else) would be as follows: > > User-space passes an array of structs with address + size (1/2/4/8 > bytes) + value. > Kernel intereates over the array when the thread is de-scheduled and > writes the specified value at the provided address/size. > Something along the following lines (pseudo-code): > > struct rseq { > ... > struct rseq_desched_notif_t* desched_notifs; > int desched_notif_count; > }; > > struct rseq_desched_notif_t { > void* addr; > uint64_t value; > int size; > }; > > static inline void rseq_preempt(struct task_struct *t) > { > ... > for (int i = 0; i < t->rseq->desched_notif_count; i++) { > switch (t->rseq->desched_notifs[i].size) { > case 1: put_user1(t->rseq->desched_notifs[i].addr, > t->rseq->desched_notifs[i].value); > case 2: put_user2(t->rseq->desched_notifs[i].addr, > t->rseq->desched_notifs[i].value); > case 4: put_user4(t->rseq->desched_notifs[i].addr, > t->rseq->desched_notifs[i].value); > case 8: put_user8(t->rseq->desched_notifs[i].addr, > t->rseq->desched_notifs[i].value); > } > } > } One thing I forgot to mention: ideally the kernel also writes a timestamp of descheduling somewhere. We are using this logic to assign per-CPU malloc caches to threads, and it's useful to know which caches were used very recently (still hot in cache) and which ones were not used for a long time.