Hello, I would like to quickly check if creating per_cpu rt_bandwidth struct would make sense. The def_rt_bandwidth struct at: https://github.com/torvalds/linux/blob/ca1fdab7fd27eb069df1384b2850dcd0c2bebe8d/kernel/sched/rt.c#L13 is used to limit the bandwidth or rt tasks. To do so, a timer with a period of sysctl_sched_rt_period is running if there is a rt task running. When enqueueing a task on a rt_rq, def_rt_bandwidth's lock must be obtained to check whether the bandwidth's timer is running: (cf rt_period_active) https://github.com/torvalds/linux/blob/ca1fdab7fd27eb069df1384b2850dcd0c2bebe8d/kernel/sched/rt.c#L102 The call graph is: inc_rt_tasks() -> inc_rt_group() -> start_rt_bandwidth() -> do_start_rt_bandwidth() def_rt_bandwidth is shared among all CPUs. Thus, some contention to access def_rt_bandwidth's lock appears with the number of CPUs running rt tasks. An example of a long lock access, with the cyclictest command: cyclictest -l20000 -m -t32 -a -i100 -d0 -p1 -q 273.278506 | | enqueue_task_rt() { 273.278506 | | dequeue_rt_stack() { 273.278506 | 0.160 us | dequeue_top_rt_rq(); 273.278507 | 0.480 us | } 273.278507 | 0.280 us | cpupri_set(); 273.278507 | 0.120 us | update_rt_migration(); 273.278511 | | _raw_spin_lock() { 273.278516 | + 17.120 us | queued_spin_lock_slowpath(); 273.278534 | + 22.840 us | } 273.278534 | 0.120 us | enqueue_top_rt_rq(); 273.278534 | + 28.360 us | } This can also be seen when running the above cyclictest command with an increasing number of threads (threads spawn on different CPUs). On an Ampere Altra with 160 CPUs, there is a strong correlation between the average latency and number of threads spawning: #threads : latency (us) 1-32 : ~5 40 : ~80 100 : ~220 Making def_rt_bandwidth a per_cpu (or per rt_rq) variable would allow to have per_cpu bandwidth timers. This would then reduce the contention when checking if a bandwidth timer is already running. A raw implementation making def_rt_bandwidth a per_cpu variable shows the following improvement: 50: ~11 -> ~7 100: ~25 -> ~12 200: ~50 -> ~18 The above was tested on a ThunderX2. Tests are done with preemption enabled, with the default bandwidth values (runtime=0.95s, period=1s). Note that setting sched_rt_runtime_us=-1 prevents from accessing the def_rt_bandwidth timer and thus also greatly improve the latency. As the CONFIG_RT_GROUP_SCHED option already allows to have per task_group bandwidth timers, I assume making rt_bandwidth a per_cpu struct is also sensible. Please let me know if this is not the case, Regards, Pierre