On 11/02/2025 10:21, Juri Lelli wrote: > On 11/02/25 09:36, Dietmar Eggemann wrote: >> On 10/02/2025 18:09, Juri Lelli wrote: >>> Hi Christian, >>> >>> Thanks for taking a look as well. >>> >>> On 07/02/25 15:55, Christian Loehle wrote: >>>> On 2/7/25 14:04, Jon Hunter wrote: >>>>> >>>>> >>>>> On 07/02/2025 13:38, Dietmar Eggemann wrote: >>>>>> On 07/02/2025 11:38, Jon Hunter wrote: >>>>>>> >>>>>>> On 06/02/2025 09:29, Juri Lelli wrote: >>>>>>>> On 05/02/25 16:56, Jon Hunter wrote: >>>>>>>> >>>>>>>> ... >>>>>>>> >>>>>>>>> Thanks! That did make it easier :-) >>>>>>>>> >>>>>>>>> Here is what I see ... >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>>> Still different from what I can repro over here, so, unfortunately, I >>>>>>>> had to add additional debug printks. Pushed to the same branch/repo. >>>>>>>> >>>>>>>> Could I ask for another run with it? Please also share the complete >>>>>>>> dmesg from boot, as I would need to check debug output when CPUs are >>>>>>>> first onlined. >>>>>> >>>>>> So you have a system with 2 big and 4 LITTLE CPUs (Denver0 Denver1 A57_0 >>>>>> A57_1 A57_2 A57_3) in one MC sched domain and (Denver1 and A57_0) are >>>>>> isol CPUs? >>>>> >>>>> I believe that 1-2 are the denvers (even thought they are listed as 0-1 in device-tree). >>>> >>>> Interesting, I have yet to reproduce this with equal capacities in isolcpus. >>>> Maybe I didn't try hard enough yet. >>>> >>>>> >>>>>> This should be easy to set up for me on my Juno-r0 [A53 A57 A57 A53 A53 A53] >>>>> >>>>> Yes I think it is similar to this. >>>>> >>>>> Thanks! >>>>> Jon >>>>> >>>> >>>> I could reproduce that on a different LLLLbb with isolcpus=3,4 (Lb) and >>>> the offlining order: >>>> echo 0 > /sys/devices/system/cpu/cpu5/online >>>> echo 0 > /sys/devices/system/cpu/cpu1/online >>>> echo 0 > /sys/devices/system/cpu/cpu3/online >>>> echo 0 > /sys/devices/system/cpu/cpu2/online >>>> echo 0 > /sys/devices/system/cpu/cpu4/online >>>> >>>> while the following offlining order succeeds: >>>> echo 0 > /sys/devices/system/cpu/cpu5/online >>>> echo 0 > /sys/devices/system/cpu/cpu4/online >>>> echo 0 > /sys/devices/system/cpu/cpu1/online >>>> echo 0 > /sys/devices/system/cpu/cpu2/online >>>> echo 0 > /sys/devices/system/cpu/cpu3/online >>>> (Both offline an isolcpus last, both have CPU0 online) >>>> >> >> Could reproduce on Juno-r0: >> >> 0 1 2 3 4 5 >> >> L b b L L L >> >> ^^^ >> isol = [3-4] so both L >> >> echo 0 > /sys/devices/system/cpu/cpu1/online >> echo 0 > /sys/devices/system/cpu/cpu4/online >> echo 0 > /sys/devices/system/cpu/cpu5/online >> echo 0 > /sys/devices/system/cpu/cpu2/online - isol >> echo 0 > /sys/devices/system/cpu/cpu3/online - isol >> >>>> The issue only triggers with sugov DL threads (I guess that's obvious, but >>>> just to mention it). >> >> IMHO, it doesn't have to be a sugov DL task. Any DL task will do. > > OK, but in this case we actually want to fail. If we have allocated > bandwidth for an actual DL task (not a dl server or a 'fake' sugov), we > don't want to inadvertently leave it w/o bandwidth by turning CPUs off. Obviously ... ;-) Same platform w/ isol = [2-3] with slow switching CPUfreq driver to force having 'sugov' tasks. # ps2 | grep DLN 95 95 S 140 0 - DLN sugov:0 96 96 S 140 0 - DLN sugov:1 # taskset -p 95; taskset -p 96 pid 95's current affinity mask: 39 pid 96's current affinity mask: 6 offline order: CPU1 -> 4 -> 5 -> 3 -> 2 ... pid 95's current affinity mask: 1 pid 96's current affinity mask: 4 root@juno:~# echo 0 > /sys/devices/system/cpu/cpu2/online [ 227.673757] dl_bw_cpus() cpu=6 rd->span=1-5 cpu_active_mask=0,2 cpus=1 [ 227.680329] dl_bw_cpus() cpu=6 rd->span=1-5 cpu_active_mask=0,2 cpus=1 [ 227.686882] dl_bw_manage: cpu=2 cap=0 fair_server_bw=52428 total_bw=157285 dl_bw_cpus=1 type=DEF span=1-5 [ 227.686900] dl_bw_cpus() cpu=6 rd->span=1-5 cpu_active_mask=0,2 cpus=1 [ 227.703066] dl_bw_manage() cpu=2 cap=0 overflow=1 return=-16 -bash: echo: write error: Device or resource busy So it seems 'sugov:1' getting in the way here. pid 95's current affinity mask: 1 pid 96's current affinity mask: 5 Looks like it's not a 'bL' issue but rather one with '>=2 CPU frequency policies' and slow-switching CPUfreq drivers.