On Sat, 2023-04-08 at 07:46 -0400, Jeff Layton wrote: > I've hit some repeated crashes in drm_dp_add_payload_part2. Here's one > from this morning that occurred not long after booting the machine. I > hadn't even logged in yet -- it was still at a gdm prompt: > > Apr 08 05:34:20 tleilax kernel: amdgpu 0000:30:00.0: [drm] Failed to create MST payload for port 0000000074d1d8eb: -5 > Apr 08 05:34:20 tleilax kernel: BUG: kernel NULL pointer dereference, address: 0000000000000008 > Apr 08 05:34:20 tleilax kernel: #PF: supervisor read access in kernel mode > Apr 08 05:34:20 tleilax kernel: #PF: error_code(0x0000) - not-present page > Apr 08 05:34:20 tleilax kernel: PGD 0 P4D 0 > Apr 08 05:34:20 tleilax kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI > Apr 08 05:34:20 tleilax kernel: CPU: 8 PID: 2278 Comm: gnome-shell Kdump: loaded Not tainted 6.2.9-200.fc37.x86_64 #1 > Apr 08 05:34:20 tleilax kernel: Hardware name: Micro-Star International Co., Ltd. MS-7A33/X370 SLI PLUS (MS-7A33), BIOS 3.JR 11/29/2019 > Apr 08 05:34:20 tleilax kernel: RIP: 0010:drm_dp_add_payload_part2+0xca/0x100 [drm_display_helper] > Apr 08 05:34:20 tleilax kernel: Code: 8b 7e 08 44 89 e9 4c 89 c2 48 c7 c6 60 d2 55 c0 e8 ab 69 54 c5 44 89 e8 5b 5d 41 5c 41 5d e9 2d 73 a2 c5 48 8b 80 60 05 00 00 <48> 8b 76 08 4c 8b 40 60 48 85 f6 74 04 48 8b 76 08 4> > Apr 08 05:34:20 tleilax kernel: RSP: 0018:ffffa4238a2db590 EFLAGS: 00010246 > Apr 08 05:34:20 tleilax kernel: RAX: ffff961550cac000 RBX: ffff961550cac000 RCX: ffffffffc055ca98 > Apr 08 05:34:20 tleilax kernel: RDX: ffff9615a6326140 RSI: 0000000000000000 RDI: ffff9615578a4568 > Apr 08 05:34:20 tleilax kernel: RBP: 0000000000000001 R08: 00000000fffffffb R09: 0000000000000000 > Apr 08 05:34:20 tleilax kernel: R10: 0000000000000002 R11: 0000000000000100 R12: ffff9615578a4000 > Apr 08 05:34:20 tleilax kernel: R13: ffff96154a5b8de0 R14: ffffffffc0d9d980 R15: ffff9615589c1f90 > Apr 08 05:34:20 tleilax kernel: FS: 00007f1c8ad775c0(0000) GS:ffff96241f000000(0000) knlGS:0000000000000000 > Apr 08 05:34:20 tleilax kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > Apr 08 05:34:20 tleilax kernel: CR2: 0000000000000008 CR3: 000000012f908000 CR4: 00000000003506e0 > Apr 08 05:34:20 tleilax kernel: Call Trace: > Apr 08 05:34:20 tleilax kernel: <TASK> > Apr 08 05:34:20 tleilax kernel: dm_helpers_dp_mst_send_payload_allocation+0x83/0xb0 [amdgpu] > Apr 08 05:34:20 tleilax kernel: dc_link_allocate_mst_payload+0x16d/0x280 [amdgpu] > Apr 08 05:34:20 tleilax kernel: core_link_enable_stream+0x8ec/0xa10 [amdgpu] > Apr 08 05:34:20 tleilax kernel: ? optc1_set_drr+0x136/0x1e0 [amdgpu] > Apr 08 05:34:20 tleilax kernel: dce110_apply_ctx_to_hw+0x61b/0x670 [amdgpu] > Apr 08 05:34:20 tleilax kernel: dc_commit_state_no_check+0x39b/0xcd0 [amdgpu] > Apr 08 05:34:20 tleilax kernel: dc_commit_state+0x107/0x120 [amdgpu] > Apr 08 05:34:20 tleilax kernel: amdgpu_dm_atomic_commit_tail+0x5bf/0x2d20 [amdgpu] > Apr 08 05:34:20 tleilax kernel: ? cpufreq_this_cpu_can_update+0x12/0x60 > Apr 08 05:34:20 tleilax kernel: ? sugov_get_util+0x7e/0x90 > Apr 08 05:34:20 tleilax kernel: ? sugov_update_single_freq+0xb7/0x180 > Apr 08 05:34:20 tleilax kernel: ? _raw_spin_lock+0x13/0x40 > Apr 08 05:34:20 tleilax kernel: ? raw_spin_rq_lock_nested+0x1e/0x70 > Apr 08 05:34:20 tleilax kernel: ? psi_group_change+0x168/0x400 > Apr 08 05:34:20 tleilax kernel: ? _raw_spin_unlock+0x15/0x30 > Apr 08 05:34:20 tleilax kernel: ? finish_task_switch.isra.0+0x9b/0x300 > Apr 08 05:34:20 tleilax kernel: ? __switch_to+0x106/0x410 > Apr 08 05:34:20 tleilax kernel: ? __schedule+0x3d4/0x13c0 > Apr 08 05:34:20 tleilax kernel: ? dma_resv_get_fences+0x11b/0x220 > Apr 08 05:34:20 tleilax kernel: ? get_nohz_timer_target+0x18/0x190 > Apr 08 05:34:20 tleilax kernel: ? lock_timer_base+0x61/0x80 > Apr 08 05:34:20 tleilax kernel: ? _raw_spin_unlock_irqrestore+0x23/0x40 > Apr 08 05:34:20 tleilax kernel: ? __mod_timer+0x29e/0x3d0 > Apr 08 05:34:20 tleilax kernel: ? preempt_count_add+0x6a/0xa0 > Apr 08 05:34:20 tleilax kernel: ? _raw_spin_lock_irq+0x19/0x40 > Apr 08 05:34:20 tleilax kernel: ? _raw_spin_unlock_irq+0x1b/0x40 > Apr 08 05:34:20 tleilax kernel: ? wait_for_completion_timeout+0x13a/0x170 > Apr 08 05:34:20 tleilax kernel: ? wait_for_completion_interruptible+0x135/0x1e0 > Apr 08 05:34:20 tleilax kernel: ? __pfx_dma_fence_default_wait_cb+0x10/0x10 > Apr 08 05:34:20 tleilax kernel: commit_tail+0x94/0x130 > Apr 08 05:34:20 tleilax kernel: drm_atomic_helper_commit+0x112/0x140 > Apr 08 05:34:20 tleilax kernel: drm_atomic_commit+0x96/0xc0 > Apr 08 05:34:20 tleilax kernel: ? __pfx___drm_printfn_info+0x10/0x10 > Apr 08 05:34:20 tleilax kernel: drm_mode_atomic_ioctl+0x959/0xb50 > Apr 08 05:34:20 tleilax kernel: ? __pfx_drm_mode_atomic_ioctl+0x10/0x10 > Apr 08 05:34:20 tleilax kernel: drm_ioctl_kernel+0xc9/0x170 > Apr 08 05:34:20 tleilax kernel: drm_ioctl+0x22f/0x410 > Apr 08 05:34:20 tleilax kernel: ? __pfx_drm_mode_atomic_ioctl+0x10/0x10 > Apr 08 05:34:20 tleilax kernel: amdgpu_drm_ioctl+0x4a/0x80 [amdgpu] > Apr 08 05:34:20 tleilax kernel: __x64_sys_ioctl+0x90/0xd0 > Apr 08 05:34:20 tleilax kernel: do_syscall_64+0x5b/0x80 > Apr 08 05:34:20 tleilax kernel: ? __x64_sys_ioctl+0xa8/0xd0 > Apr 08 05:34:20 tleilax kernel: ? syscall_exit_to_user_mode+0x17/0x40 > Apr 08 05:34:20 tleilax kernel: ? do_syscall_64+0x67/0x80 > Apr 08 05:34:20 tleilax kernel: ? sched_clock_cpu+0xb/0xc0 > Apr 08 05:34:20 tleilax kernel: ? __irq_exit_rcu+0x3d/0x140 > Apr 08 05:34:20 tleilax kernel: entry_SYSCALL_64_after_hwframe+0x72/0xdc > Apr 08 05:34:20 tleilax kernel: RIP: 0033:0x7f1c8e723d6f > Apr 08 05:34:20 tleilax kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 0> > Apr 08 05:34:20 tleilax kernel: RSP: 002b:00007ffea61067d0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 > Apr 08 05:34:20 tleilax kernel: RAX: ffffffffffffffda RBX: 00005571af410fb0 RCX: 00007f1c8e723d6f > Apr 08 05:34:20 tleilax kernel: RDX: 00007ffea6106870 RSI: 00000000c03864bc RDI: 000000000000000a > Apr 08 05:34:20 tleilax kernel: RBP: 00007ffea6106870 R08: 0000000000000011 R09: 0000000000000011 > Apr 08 05:34:20 tleilax kernel: R10: 00005571ae320010 R11: 0000000000000246 R12: 00000000c03864bc > Apr 08 05:34:20 tleilax kernel: R13: 000000000000000a R14: 00005571ae6ff140 R15: 00005571b0261950 > Apr 08 05:34:20 tleilax kernel: </TASK> > Apr 08 05:34:20 tleilax kernel: Modules linked in: rfcomm snd_seq_dummy snd_hrtimer rpcrdma rdma_cm iw_cm ib_cm ib_core xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp nf_conntrack_tftp nf_conntrack_netbi> > Apr 08 05:34:20 tleilax kernel: videobuf2_memops rapl mxm_wmi videobuf2_v4l2 wmi_bmof snd_pcm k10temp rfkill pcspkr videobuf2_common i2c_piix4 snd_timer joydev videodev snd mc parport_pc soundcore parport gpio_amdpt g> > Apr 08 05:34:20 tleilax kernel: CR2: 0000000000000008 > Apr 08 05:34:20 tleilax kernel: ---[ end trace 0000000000000000 ]--- > > $ ./scripts/faddr2line --list /usr/lib/debug/lib/modules/6.2.9-200.fc37.x86_64/kernel/drivers/gpu/drm/display/drm_display_helper.ko.debug drm_dp_add_payload_part2+0xca/0x100 > drm_dp_add_payload_part2+0xca/0x100: > > drm_dp_add_payload_part2 at /usr/src/debug/kernel-6.2.9/linux-6.2.9-200.fc37.x86_64/drivers/gpu/drm/display/drm_dp_mst_topology.c:3407 > 3402 { > 3403 int ret = 0; > 3404 > 3405 /* Skip failed payloads */ > 3406 if (payload->vc_start_slot == -1) { > > 3407< drm_dbg_kms(state->dev, "Part 1 of payload creation for %s failed, skipping part 2\n", > 3408 payload->port->connector->name); > 3409 return -EIO; > 3410 } > 3411 > 3412 ret = drm_dp_create_payload_step2(mgr, payload); > > Since %rsi is NULL and the ->dev field is 8 bytes into the struct, I'm > guessing that means that "state" was NULL here. > > I'm assuming that the real bug is in the caller (and I'm happy to help > track that down), but would it make sense to allow this function to > gracefully handle a NULL state pointer? IOW something like this? > > drm_dbg_kms(state ? state->dev : NULL, "Part 1 of payload creation for %s failed, skipping part 2\n", > > I think that would at least prevent this problem from crashing the machine. > FWIW, I patched my kernel with the above, and it did seem to save the box from crashing when this happened again: [14357.953046] amdgpu 0000:30:00.0: [drm] Failed to create MST payload for port 000000006d3a3885: -5 [14358.025845] [drm] DM_MST: stopping TM on aconnector: 00000000ef1bcb79 [id: 86] [14358.593779] [drm] DM_MST: starting TM on aconnector: 00000000ef1bcb79 [id: 86] In this case, all of my windows got moved to the secondary monitor, but the machine stayed up and running. I think seems to mostly occur when the display goes to sleep. -- Jeff Layton <jlayton@xxxxxxxxxx>