----- On Feb 1, 2022, at 2:49 PM, Peter Oskolkov posk@xxxxxxx wrote: > On Tue, Feb 1, 2022 at 11:26 AM Mathieu Desnoyers > <mathieu.desnoyers@xxxxxxxxxxxx> wrote: >> >> This feature allows the scheduler to expose a current virtual cpu id >> to user-space. This virtual cpu id is within the possible cpus range, >> and is temporarily (and uniquely) assigned while threads are actively >> running within a thread group. If a thread group has fewer threads than >> cores, or is limited to run on few cores concurrently through sched >> affinity or cgroup cpusets, the virtual cpu ids will be values close >> to 0, thus allowing efficient use of user-space memory for per-cpu >> data structures. > > Why per thread group and not per mm? The main use case is for > per-(v)cpu memory allocation logic, so it seems having this feature > per mm is more appropriate? Good point, yes, per-mm would be more appropriate. So I guess that from a userspace perspective, the rseq field could become "__u32 vm_vcpu; /* Current vcpu within memory space. */" [...] >> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h >> index b6ecb9fc4cd2..c87e7ad5a1ea 100644 >> --- a/include/linux/sched/signal.h >> +++ b/include/linux/sched/signal.h >> @@ -244,6 +244,12 @@ struct signal_struct { >> * and may have inconsistent >> * permissions. >> */ >> +#ifdef CONFIG_SCHED_THREAD_GROUP_VCPU >> + /* >> + * Mask of allocated vcpu ids within the thread group. >> + */ >> + cpumask_t vcpu_mask; > > We use a pointer for the mask (in struct mm). Adds complexity around > alloc/free, though. Just FYI. It does make sense if this is opt-in. [...] >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >> index 2e4ae00e52d1..2690e80977b1 100644 >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -4795,6 +4795,8 @@ prepare_task_switch(struct rq *rq, struct task_struct >> *prev, >> sched_info_switch(rq, prev, next); >> perf_event_task_sched_out(prev, next); >> rseq_preempt(prev); >> + tg_vcpu_put(prev); >> + tg_vcpu_get(next); > > Doing this for all tasks on all context switches will most likely be > too expensive. We do it only for tasks that explicitly asked for this > feature during their rseq registration, and still the tight loop in > our equivalent of tg_vcpu_get() is occasionally noticeable (lots of > short wakeups can lead to the loop thrashing around). > > Again, our approach is more complicated as a result. I suspect that the overhead of tg_vcpu_get is quite small for processes which work on only few cores, but becomes noticeable when processes have many threads and are massively parallel (not affined to only a few cores). When the feature is disabled, we can always fall-back on the value returned by raw_smp_processor_id() and use that as a "vm-vcpu-id" value. Whether the vm-vcpu-id or the processor id is used needs to be a consensus across all threads from all processes using a mm at a given time. There appears to be a tradeoff here, and I wonder how this should be presented to users. A few possible options: - vm-vcpu feature is opt-in (default off) or opt-out (default on), - whether vm-vcpu is enabled for a process could be selected at runtime by the process, either at process initialization (single thread, single mm user) and/or while the process is multi-threaded (requires more synchronization), - if we find a way to move automatically between vm-vcpu-id and processor id as information source for all threads tied to a mm when we reach a number of parallel threads threshold, then I suspect we could have best of both worlds. But it's not clear to me how to achieve this. Thoughts ? Thanks, Mathieu > >> fire_sched_out_preempt_notifiers(prev, next); >> kmap_local_sched_out(); >> prepare_task(next); >> -- >> 2.17.1 -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com