On 12/02/25 23:01, Jon Hunter wrote: > > On 11/02/2025 10:42, Juri Lelli wrote: > > On 11/02/25 10:15, Christian Loehle wrote: > > > On 2/10/25 17:09, Juri Lelli wrote: > > > > Hi Christian, > > > > > > > > Thanks for taking a look as well. > > > > > > > > On 07/02/25 15:55, Christian Loehle wrote: > > > > > On 2/7/25 14:04, Jon Hunter wrote: > > > > > > > > > > > > > > > > > > On 07/02/2025 13:38, Dietmar Eggemann wrote: > > > > > > > On 07/02/2025 11:38, Jon Hunter wrote: > > > > > > > > > > > > > > > > On 06/02/2025 09:29, Juri Lelli wrote: > > > > > > > > > On 05/02/25 16:56, Jon Hunter wrote: > > > > > > > > > > > > > > > > > > ... > > > > > > > > > > > > > > > > > > > Thanks! That did make it easier :-) > > > > > > > > > > > > > > > > > > > > Here is what I see ... > > > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > > > Still different from what I can repro over here, so, unfortunately, I > > > > > > > > > had to add additional debug printks. Pushed to the same branch/repo. > > > > > > > > > > > > > > > > > > Could I ask for another run with it? Please also share the complete > > > > > > > > > dmesg from boot, as I would need to check debug output when CPUs are > > > > > > > > > first onlined. > > > > > > > > > > > > > > So you have a system with 2 big and 4 LITTLE CPUs (Denver0 Denver1 A57_0 > > > > > > > A57_1 A57_2 A57_3) in one MC sched domain and (Denver1 and A57_0) are > > > > > > > isol CPUs? > > > > > > > > > > > > I believe that 1-2 are the denvers (even thought they are listed as 0-1 in device-tree). > > > > > > > > > > Interesting, I have yet to reproduce this with equal capacities in isolcpus. > > > > > Maybe I didn't try hard enough yet. > > > > > > > > > > > > > > > > > > This should be easy to set up for me on my Juno-r0 [A53 A57 A57 A53 A53 A53] > > > > > > > > > > > > Yes I think it is similar to this. > > > > > > > > > > > > Thanks! > > > > > > Jon > > > > > > > > > > > > > > > > I could reproduce that on a different LLLLbb with isolcpus=3,4 (Lb) and > > > > > the offlining order: > > > > > echo 0 > /sys/devices/system/cpu/cpu5/online > > > > > echo 0 > /sys/devices/system/cpu/cpu1/online > > > > > echo 0 > /sys/devices/system/cpu/cpu3/online > > > > > echo 0 > /sys/devices/system/cpu/cpu2/online > > > > > echo 0 > /sys/devices/system/cpu/cpu4/online > > > > > > > > > > while the following offlining order succeeds: > > > > > echo 0 > /sys/devices/system/cpu/cpu5/online > > > > > echo 0 > /sys/devices/system/cpu/cpu4/online > > > > > echo 0 > /sys/devices/system/cpu/cpu1/online > > > > > echo 0 > /sys/devices/system/cpu/cpu2/online > > > > > echo 0 > /sys/devices/system/cpu/cpu3/online > > > > > (Both offline an isolcpus last, both have CPU0 online) > > > > > > > > > > The issue only triggers with sugov DL threads (I guess that's obvious, but > > > > > just to mention it). > > > > > > > > It wasn't obvious to me at first :). So thanks for confirming. > > > > > > > > > I'll investigate some more later but wanted to share for now. > > > > > > > > So, problem actually is that I am not yet sure what we should do with > > > > sugovs' bandwidth wrt root domain accounting. W/o isolation it's all > > > > good, as it gets accounted for correctly on the dynamic domains sugov > > > > tasks can run on. But with isolation and sugov affected_cpus that cross > > > > isolation domains (e.g., one BIG one little), we can get into troubles > > > > not knowing if sugov contribution should fall on the DEF or DYN domain. > > > > > > > > Hummm, need to think more about it. > > > > > > That is indeed tricky. > > > I would've found it super appealing to always just have sugov DL tasks activate > > > on this_cpu and not have to worry about all this, but then you have contention > > > amongst CPUs of a cluster and there are energy improvements from always > > > having little cores handle all sugov DL tasks, even for the big CPUs, > > > that's why I introduced > > > commit 93940fbdc468 ("cpufreq/schedutil: Only bind threads if needed") > > > but that really doesn't make this any easier. > > > > What about we actually ignore them consistently? We already do that for > > admission control, so maybe we can do that when rebuilding domains as > > well (until we find maybe a better way to deal with them). > > > > Does the following make any difference? > > > > --- > > kernel/sched/deadline.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c > > index b254d878789d..8f7420e0c9d6 100644 > > --- a/kernel/sched/deadline.c > > +++ b/kernel/sched/deadline.c > > @@ -2995,7 +2995,7 @@ void dl_add_task_root_domain(struct task_struct *p) > > struct dl_bw *dl_b; > > raw_spin_lock_irqsave(&p->pi_lock, rf.flags); > > - if (!dl_task(p)) { > > + if (!dl_task(p) || dl_entity_is_special(&p->dl)) { > > raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags); > > return; > > } > > > > I have tested this on top of v6.14-rc2, but this is still not resolving the > issue for me :-( Thanks for testing. Was the testing using the full stack of changes I proposed so far? I believe we still have to fix the accounting of dl_servers for def root domain (there is a patch that should do that). I updated the branch with the full set. In case it still fails, could you please collect dmesg and tracing output as I suggested and share? Best, Juri