On 27/08/24 11:35, Paul E. McKenney wrote: > On Tue, Aug 27, 2024 at 10:33:13AM -0700, Paul E. McKenney wrote: >> On Tue, Aug 27, 2024 at 05:41:52PM +0200, Valentin Schneider wrote: >> > I've taken tip/sched/core and shuffled hunks around; I didn't re-order any >> > commit. I've also taken out the dequeue from switched_from_fair() and put >> > it at the very top of the branch which should hopefully help bisection. >> > >> > The final delta between that branch and tip/sched/core is empty, so it >> > really is just shuffling inbetween commits. >> > >> > Please find the branch at: >> > >> > https://gitlab.com/vschneid/linux.git -b mainline/sched/eevdf-complete-builderr >> > >> > I'll go stare at the BUG itself now. >> >> Thank you! >> >> I have fired up tests on the "BROKEN?" commit. If that fails, I will >> try its predecessor, and if that fails, I wlll bisect from e28b5f8bda01 >> ("sched/fair: Assert {set_next,put_prev}_entity() are properly balanced"), >> which has stood up to heavy hammering in earlier testing. > > And of 50 runs of TREE03 on the "BROKEN?" commit resulted in 32 failures. > Of these, 29 were the dequeue_rt_stack() failure. Two more were RCU > CPU stall warnings, and the last one was an oddball "kernel BUG at > kernel/sched/rt.c:1714" followed by an equally oddball "Oops: invalid > opcode: 0000 [#1] PREEMPT SMP PTI". > > Just to be specific, this is commit: > > df8fe34bfa36 ("BROKEN? sched/fair: Dequeue sched_delayed tasks when switching from fair") > > This commit's predecessor is this commit: > > 2f888533d073 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy") > > This predecessor commit passes 50 runs of TREE03 with no failures. > > So that addition of that dequeue_task() call to the switched_from_fair() > function is looking quite suspicious to me. ;-) > > Thanx, Paul Thanks for the testing! The WARN_ON_ONCE(!rt_se->on_list); hit in __dequeue_rt_entity() feels like a put_prev/set_next kind of issue... So far I'd assumed a ->sched_delayed task can't be current during switched_from_fair(), I got confused because it's Mond^CCC Tuesday, but I think that still holds: we can't get a balance_dl() or balance_rt() to drop the RQ lock because prev would be fair, and we can't get a newidle_balance() with a ->sched_delayed task because we'd have sched_fair_runnable() := true. I'll pick this back up tomorrow, this is a task that requires either caffeine or booze and it's too late for either.