čt 10. 10. 2024 v 13:24 odesílatel Tomas Glozar <tglozar@xxxxxxxxxx> napsal: > > Unfortunately, the following rcu stalls appear to have resulted in > abnormal termination of the VM, which led to the ftrace buffer not > being dumped into the console. Currently re-running the same test with > the addition of "ftrace_dump_on_oops panic_on_warn=1" and hoping for > the best. > Adding ftrace_dump_on_oops and panic_on_warn fixed the tracing issue. The last 20 lines on CPU 1, where the WARN was triggered, are: [21031.703970] rcu_tort-6861 1dN.4. 21027715638us : sched_wakeup: comm=ksoftirqd/1 pid=26 prio=97 target_cpu=001 [21031.705308] rcu_tort-6861 1d..2. 21027715639us : sched_switch: prev_comm=rcu_torture_boo prev_pid=6861 prev_prio=98 prev_state=R+ ==> next_comm=ksoftirqd/1 next_pid=26 next_prio=97 [21031.707388] ksoftirq-26 1d.s3. 21027715645us : sched_wakeup: comm=rcu_preempt pid=16 prio=97 target_cpu=000 [21031.710718] ksoftirq-26 1d..2. 21027715649us : sched_switch: prev_comm=ksoftirqd/1 prev_pid=26 prev_prio=97 prev_state=S ==> next_comm=rcu_torture_boo next_pid=6861 next_prio=98 [21031.719409] rcu_tort-6861 1d..2. 21027719648us : sched_switch: prev_comm=rcu_torture_boo prev_pid=6861 prev_prio=98 prev_state=R+ ==> next_comm=rcu_preempt next_pid=16 next_prio=97 [21031.723490] rcu_pree-16 1d..2. 21027719657us : sched_switch: prev_comm=rcu_preempt prev_pid=16 prev_prio=97 prev_state=I ==> next_comm=rcu_torture_boo next_pid=6861 next_prio=98 [21031.725527] rcu_tort-6861 1dN.3. 21027724637us : sched_wakeup: comm=ksoftirqd/1 pid=26 prio=97 target_cpu=001 [21031.726817] rcu_tort-6861 1d..2. 21027724638us : sched_switch: prev_comm=rcu_torture_boo prev_pid=6861 prev_prio=98 prev_state=R+ ==> next_comm=ksoftirqd/1 next_pid=26 next_prio=97 [21031.728867] ksoftirq-26 1d.s3. 21027724644us : sched_wakeup: comm=rcu_preempt pid=16 prio=97 target_cpu=008 [21031.732215] ksoftirq-26 1d..2. 21027724648us : sched_switch: prev_comm=ksoftirqd/1 prev_pid=26 prev_prio=97 prev_state=S ==> next_comm=rcu_torture_boo next_pid=6861 next_prio=98 [21031.751734] rcu_tort-6861 1d..2. 21027729646us : sched_switch: prev_comm=rcu_torture_boo prev_pid=6861 prev_prio=98 prev_state=R+ ==> next_comm=rcu_preempt next_pid=16 next_prio=97 [21031.755815] rcu_pree-16 1d..2. 21027729656us : sched_switch: prev_comm=rcu_preempt prev_pid=16 prev_prio=97 prev_state=I ==> next_comm=rcu_torture_boo next_pid=6861 next_prio=98 [21031.757850] rcu_tort-6861 1dN.4. 21027734637us : sched_wakeup: comm=ksoftirqd/1 pid=26 prio=97 target_cpu=001 [21031.759147] rcu_tort-6861 1d..2. 21027734638us : sched_switch: prev_comm=rcu_torture_boo prev_pid=6861 prev_prio=98 prev_state=R+ ==> next_comm=ksoftirqd/1 next_pid=26 next_prio=97 [21031.761193] ksoftirq-26 1d.s3. 21027734643us : sched_wakeup: comm=rcu_preempt pid=16 prio=97 target_cpu=009 [21031.764531] ksoftirq-26 1d..2. 21027734648us : sched_switch: prev_comm=ksoftirqd/1 prev_pid=26 prev_prio=97 prev_state=S ==> next_comm=rcu_torture_boo next_pid=6861 next_prio=98 [21031.871062] rcu_tort-6861 1d..2. 21027768837us : sched_switch: prev_comm=rcu_torture_boo prev_pid=6861 prev_prio=98 prev_state=I ==> next_comm=kworker/1:1 next_pid=5774 next_prio=120 [21031.882721] kworker/-5774 1d..2. 21027768847us : sched_switch: prev_comm=kworker/1:1 prev_pid=5774 prev_prio=120 prev_state=I ==> next_comm=kworker/1:2 next_pid=6636 next_prio=120 [21031.889173] kworker/-6636 1dN.3. 21027768851us : dl_server_stop <-dequeue_entities [21031.902411] kworker/-6636 1dN.3. 21027768863us : dl_server_start <-enqueue_task_fair and then the trace goes silent for that CPU. That corresponds to the warning after which the panic is triggered: [21027.819253] WARNING: CPU: 1 PID: 6636 at kernel/sched/deadline.c:1995 enqueue_dl_entity+0x516/0x5d0 [21027.820450] Modules linked in: [21027.820856] CPU: 1 UID: 0 PID: 6636 Comm: kworker/1:2 Not tainted 6.11.0-g2004cef11ea0-dirty #1 [21027.821993] Hardware name: Red Hat KVM/RHEL, BIOS 1.16.3-2.el9 04/01/2014 [21027.822883] Workqueue: 0x0 (rcu_gp) [21027.823363] RIP: 0010:enqueue_dl_entity+0x516/0x5d0 [21027.824002] Code: ff 48 89 ef e8 8b cf ff ff 0f b6 4d 54 e9 0e fc ff ff 85 db 0f 84 d0 fe ff ff 5b 44 89 e6 48 89 ef 5d 41 5c e9 db d6 ff ff 90 <0f> 0b 90 e9 fa fa ff ff 48 8b bb f8 09 00 00 48 39 fe 0f 89 de fb [21027.827134] RSP: 0000:ffff8cbd85713bf0 EFLAGS: 00010086 [21027.827817] RAX: 0000000000000001 RBX: ffff8919dc86cc28 RCX: 0000000000000002 [21027.828749] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8919dc86cc28 [21027.829678] RBP: ffff8919dc86cc28 R08: 0000000000000001 R09: 00000000000001f4 [21027.830608] R10: 0000000000000000 R11: ffff8919c1efe610 R12: 0000000000000001 [21027.831537] R13: 0000000000225510 R14: ffff8919dc86c380 R15: ffff8919dc86c3c0 [21027.832468] FS: 0000000000000000(0000) GS:ffff8919dc840000(0000) knlGS:0000000000000000 [21027.833521] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [21027.834277] CR2: 0000000000000000 CR3: 000000001e62e000 CR4: 00000000000006f0 [21027.835204] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [21027.836127] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [21027.837049] Call Trace: [21027.837388] <TASK> [21027.837671] ? __warn+0x88/0x130 [21027.838107] ? enqueue_dl_entity+0x516/0x5d0 [21027.838665] ? report_bug+0x18e/0x1a0 [21027.839181] ? handle_bug+0x54/0x90 [21027.839640] ? exc_invalid_op+0x18/0x70 [21027.840145] ? asm_exc_invalid_op+0x1a/0x20 [21027.840696] ? enqueue_dl_entity+0x516/0x5d0 [21027.841259] dl_server_start+0x36/0xf0 [21027.841751] enqueue_task_fair+0x220/0x6b0 [21027.842291] activate_task+0x26/0x60 [21027.842761] attach_task+0x35/0x50 [21027.843213] sched_balance_rq+0x663/0xe00 [21027.843739] sched_balance_newidle.constprop.0+0x1a5/0x360 [21027.844455] pick_next_task_fair+0x2f/0x340 [21027.845000] __schedule+0x203/0x900 [21027.845470] ? sync_rcu_do_polled_gp+0xba/0x110 [21027.846068] schedule+0x27/0xd0 [21027.846483] worker_thread+0x1a7/0x3b0 [21027.846974] ? __pfx_worker_thread+0x10/0x10 [21027.847539] kthread+0xd6/0x100 [21027.847955] ? __pfx_kthread+0x10/0x10 [21027.848455] ret_from_fork+0x34/0x50 [21027.848923] ? __pfx_kthread+0x10/0x10 [21027.849420] ret_from_fork_asm+0x1a/0x30 [21027.849934] </TASK> Apparently, dl_server_start is not called twice so it has to be something else messing up with the dl_se->rb_node. Tomas