On Tue, Aug 27, 2024 at 10:30:24PM +0200, Valentin Schneider wrote: > On 27/08/24 11:35, Paul E. McKenney wrote: > > On Tue, Aug 27, 2024 at 10:33:13AM -0700, Paul E. McKenney wrote: > >> On Tue, Aug 27, 2024 at 05:41:52PM +0200, Valentin Schneider wrote: > >> > I've taken tip/sched/core and shuffled hunks around; I didn't re-order any > >> > commit. I've also taken out the dequeue from switched_from_fair() and put > >> > it at the very top of the branch which should hopefully help bisection. > >> > > >> > The final delta between that branch and tip/sched/core is empty, so it > >> > really is just shuffling inbetween commits. > >> > > >> > Please find the branch at: > >> > > >> > https://gitlab.com/vschneid/linux.git -b mainline/sched/eevdf-complete-builderr > >> > > >> > I'll go stare at the BUG itself now. > >> > >> Thank you! > >> > >> I have fired up tests on the "BROKEN?" commit. If that fails, I will > >> try its predecessor, and if that fails, I wlll bisect from e28b5f8bda01 > >> ("sched/fair: Assert {set_next,put_prev}_entity() are properly balanced"), > >> which has stood up to heavy hammering in earlier testing. > > > > And of 50 runs of TREE03 on the "BROKEN?" commit resulted in 32 failures. > > Of these, 29 were the dequeue_rt_stack() failure. Two more were RCU > > CPU stall warnings, and the last one was an oddball "kernel BUG at > > kernel/sched/rt.c:1714" followed by an equally oddball "Oops: invalid > > opcode: 0000 [#1] PREEMPT SMP PTI". > > > > Just to be specific, this is commit: > > > > df8fe34bfa36 ("BROKEN? sched/fair: Dequeue sched_delayed tasks when switching from fair") > > > > This commit's predecessor is this commit: > > > > 2f888533d073 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy") > > > > This predecessor commit passes 50 runs of TREE03 with no failures. > > > > So that addition of that dequeue_task() call to the switched_from_fair() > > function is looking quite suspicious to me. ;-) > > > > Thanx, Paul > > Thanks for the testing! > > The WARN_ON_ONCE(!rt_se->on_list); hit in __dequeue_rt_entity() feels like > a put_prev/set_next kind of issue... > > So far I'd assumed a ->sched_delayed task can't be current during > switched_from_fair(), I got confused because it's Mond^CCC Tuesday, but I > think that still holds: we can't get a balance_dl() or balance_rt() to drop > the RQ lock because prev would be fair, and we can't get a > newidle_balance() with a ->sched_delayed task because we'd have > sched_fair_runnable() := true. > > I'll pick this back up tomorrow, this is a task that requires either > caffeine or booze and it's too late for either. Thank you for chasing this, and get some sleep! This one is of course annoying, but it is not (yet) an emergency. I look forward to seeing what you come up with. Also, I would of course be happy to apply debug patches. Thanx, Paul