Re: BUG in drm_kms_helper_poll_enable() fixed by reverting "drm/ast: report connection status on Display Port."

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 09/11/2023 01:37, Kim Phillips wrote:
Hi, current linux kernel commit 90450a06162e
("Merge tag 'rcu-fixes-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks")
and the attached config cause the following BUG when booting on
a reference AMD Zen4 development server:

[   59.995717] input: OpenBMC virtual_input as /devices/pci0000:00/0000:00:07.1/0000:02:00.4/usb3/3-2/3-2.6/3-2.6:1.0/0003:1D6B:0104.0002/input/input4
[   60.033135] ast 0000:c2:00.0: vgaarb: deactivate vga console
[   60.066230] ast 0000:c2:00.0: [drm] Using default configuration
[   60.070342] hid-generic 0003:1D6B:0104.0002: input,hidraw0: USB HID v1.01 Keyboard [OpenBMC virtual_input] on usb-0000:02:00.4-2.6/input0
[   60.072843] ast 0000:c2:00.0: [drm] AST 2600 detected
[   60.072851] ast 0000:c2:00.0: [drm] Using ASPEED DisplayPort transmitter
[   60.099891] ast 0000:c2:00.0: [drm] dram MCLK=396 Mhz type=1 bus_width=16 [   60.115780] [drm] Initialized ast 0.1.0 20120228 for 0000:c2:00.0 on minor 0
[   60.135643] fbcon: astdrmfb (fb0) is primary device
[   60.135649] fbcon: Deferring console take-over
[   60.146162] ast 0000:c2:00.0: [drm] fb0: astdrmfb frame buffer device
[   60.331802] input: OpenBMC virtual_input as /devices/pci0000:00/0000:00:07.1/0000:02:00.4/usb3/3-2/3-2.6/3-2.6:1.0/0003:1D6B:0104.0002/input/input5 [   60.405807] hid-generic 0003:1D6B:0104.0002: input,hidraw0: USB HID v1.01 Keyboard [OpenBMC virtual_input] on usb-0000:02:00.4-2.6/input0 [   60.423774] input: OpenBMC virtual_input as /devices/pci0000:00/0000:00:07.1/0000:02:00.4/usb3/3-2/3-2.6/3-2.6:1.1/0003:1D6B:0104.0004/input/input6 [   60.443170] hid-generic 0003:1D6B:0104.0004: input,hidraw1: USB HID v1.01 Mouse [OpenBMC virtual_input] on usb-0000:02:00.4-2.6/input1
[   60.460675] ast 0000:c2:00.0: vgaarb: deactivate vga console
[   60.479996] ast 0000:c2:00.0: [drm] Using default configuration
[   60.486603] ast 0000:c2:00.0: [drm] AST 2600 detected
[   60.492249] ast 0000:c2:00.0: [drm] Using ASPEED DisplayPort transmitter
[   60.499732] ast 0000:c2:00.0: [drm] dram MCLK=396 Mhz type=1 bus_width=16 [   60.508955] BUG: unable to handle page fault for address: ffff8881e98109f0
[   60.516623] #PF: supervisor write access in kernel mode
[   60.522449] #PF: error_code(0x0002) - not-present page
[   60.528168] PGD 8dbc01067 P4D 8dbc01067 PUD 104c984067 PMD 104c837067 PTE 800ffffe167ef060
[   60.537394] Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC NOPTI
[   60.543805] CPU: 0 PID: 9 Comm: kworker/0:1 Tainted: G W          6.6.0+ #3 [   60.552251] Hardware name: AMD Corporation ONYX/ONYX, BIOS ROX100AB 09/14/2023
[   60.560309] Workqueue: events work_for_cpu_fn
[   60.565173] RIP: 0010:enqueue_timer (/home/amd/git/linux/./include/linux/list.h:1034 /home/amd/git/linux/kernel/time/timer.c:605) [ 60.570129] Code: 44 00 00 55 48 89 e5 41 55 49 89 cd 41 54 49 89 fc 53 48 89 f3 89 d6 48 8d 84 f7 b0 00 00 00 48 8b 08 48 89 0b 48 85 c9 74 04 <48> 89 59 08 48 89 18 48 89 43 08 49 8d 44 24 68 48 0f ab 30 8b 4b
All code
========
    0:   44 00 00                add    %r8b,(%rax)
    3:   55                      push   %rbp
    4:   48 89 e5                mov    %rsp,%rbp
    7:   41 55                   push   %r13
    9:   49 89 cd                mov    %rcx,%r13
    c:   41 54                   push   %r12
    e:   49 89 fc                mov    %rdi,%r12
   11:   53                      push   %rbx
   12:   48 89 f3                mov    %rsi,%rbx
   15:   89 d6                   mov    %edx,%esi
   17:   48 8d 84 f7 b0 00 00    lea    0xb0(%rdi,%rsi,8),%rax
   1e:   00
   1f:   48 8b 08                mov    (%rax),%rcx
   22:   48 89 0b                mov    %rcx,(%rbx)
   25:   48 85 c9                test   %rcx,%rcx
   28:   74 04                   je     0x2e
  2a:*  48 89 59 08             mov    %rbx,0x8(%rcx)           <-- trapping instruction
   2e:   48 8
   31:   48 89 43 08             mov    %rax,0x8(%rbx)
   35:   49 8d 44 24 68          lea    0x68(%r12),%rax
   3a:   48 0f ab 30             bts    %rsi,(%rax)
   3e:   8b                      .byte 0x8b
   3f:   4b                      rex.WXB

Code starting with the faulting instruction
===========================================
    0:   48 89 59 08             mov    %rbx,0x8(%rcx)
    4:   48 89 18                mov    %rbx,(%rax)
    7:   48 89 43 08             mov    %rax,0x8(%rbx)
    b:   49 8d 44 24 68          lea    0x68(%r12),%rax
   10:   48 0f ab 30             bts    %rsi,(%rax)
   14:   8b                      .byte 0x8b
   15:   4b                      rex.WXB
[   60.591081] RSP: 0018:ffffc900000dbbe0 EFLAGS: 00010086
[   60.596908] RAX: ffff888fd59e31b8 RBX: ffff8881ec87c9e8 RCX: ffff8881e98109e8 [   60.604866] RDX: 0000000000000099 RSI: 0000000000000099 RDI: ffff888fd59e2c40 [   60.612826] RBP: ffffc900000dbbf8 R08: 0000000000000001 R09: ffff888fd59e2c40 [   60.620787] R10: 000000000000550d R11: 0000000000000000 R12: ffff888fd59e2c40 [   60.628748] R13: 00000000ffff1640 R14: 00000000ffff163c R15: 0000000000000000 [   60.636706] FS:  0000000000000000(0000) GS:ffff888fd5800000(0000) knlGS:0000000000000000
[   60.645732] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   60.652141] CR2: ffff8881e98109f0 CR3: 00000008d5e3c003 CR4: 0000000000770ef0
[   60.660101] PKRU: 55555554
[   60.663114] Call Trace:
[   60.665838]  <TASK>
[   60.668174] ? show_regs (/home/amd/git/linux/arch/x86/kernel/dumpstack.c:479) [   60.671971] ? __die (/home/amd/git/linux/arch/x86/kernel/dumpstack.c:421 /home/amd/git/linux/arch/x86/kernel/dumpstack.c:434) [   60.675375] ? page_fault_oops (/home/amd/git/linux/arch/x86/mm/fault.c:707) [   60.679942] ? search_bpf_extables (/home/amd/git/linux/kernel/bpf/core.c:765) [   60.684800] ? enqueue_timer (/home/amd/git/linux/./include/linux/list.h:1034 /home/amd/git/linux/kernel/time/timer.c:605) [   60.689077] ? srso_alias_return_thunk (/home/amd/git/linux/arch/x86/lib/retpoline.S:181) [   60.694422] ? search_exception_tables (/home/amd/git/linux/kernel/extable.c:64) [   60.699571] ? srso_alias_return_thunk (/home/amd/git/linux/arch/x86/lib/retpoline.S:181) [   60.704917] ? kernelmode_fixup_or_oops (/home/amd/git/linux/arch/x86/mm/fault.c:762) [   60.710256] ? __bad_area_nosemaphore (/home/amd/git/linux/arch/x86/mm/fault.c:860) [   60.715505] ? bad_area_nosemaphore (/home/amd/git/linux/arch/x86/mm/fault.c:867) [   60.720364] ? do_kern_addr_fault (/home/amd/git/linux/arch/x86/mm/fault.c:1227) [   60.725030] ? exc_page_fault (/home/amd/git/linux/arch/x86/mm/fault.c:1503 /home/amd/git/linux/arch/x86/mm/fault.c:1561) [   60.729503] ? asm_exc_page_fault (/home/amd/git/linux/./arch/x86/include/asm/idtentry.h:570) [   60.734174] ? enqueue_timer (/home/amd/git/linux/./include/linux/list.h:1034 /home/amd/git/linux/kernel/time/timer.c:605) [   60.738453] __mod_timer (/home/amd/git/linux/kernel/time/timer.c:635 /home/amd/git/linux/kernel/time/timer.c:1131) [   60.742439] ? local_clock_noinstr (/home/amd/git/linux/kernel/sched/clock.c:301)
[   60.747202] add_timer (/home/amd/git/linux/kernel/time/timer.c:1245)
[   60.750798] __queue_delayed_work (/home/amd/git/linux/kernel/workqueue.c:1962) [   60.755463] queue_delayed_work_on (/home/amd/git/linux/kernel/workqueue.c:1987) [   60.760226] drm_kms_helper_poll_enable (/home/amd/git/linux/drivers/gpu/drm/drm_probe_helper.c:310) drm_kms_helper [   60.767229] drm_kms_helper_poll_init (/home/amd/git/linux/drivers/gpu/drm/drm_probe_helper.c:914) drm_kms_helper [   60.773936] ast_mode_config_init (/home/amd/git/linux/drivers/gpu/drm/ast/ast_mode.c:1931) ast [   60.779382] ast_device_create (/home/amd/git/linux/drivers/gpu/drm/ast/ast_main.c:518) ast [   60.784533] ast_pci_probe (/home/amd/git/linux/drivers/gpu/drm/ast/ast_drv.c:106) ast [   60.789107] local_pci_probe (/home/amd/git/linux/drivers/pci/pci-driver.c:324) [   60.793292] work_for_cpu_fn (/home/amd/git/linux/kernel/workqueue.c:5621) [   60.797471] process_one_work (/home/amd/git/linux/kernel/workqueue.c:2630) [   60.801941] ? process_one_work (/home/amd/git/linux/kernel/workqueue.c:2605) [   60.806608] worker_thread (/home/amd/git/linux/kernel/workqueue.c:2697 /home/amd/git/linux/kernel/workqueue.c:2784) [   60.810790] ? __pfx_worker_thread (/home/amd/git/linux/kernel/workqueue.c:2730)
[   60.815554] kthread (/home/amd/git/linux/kernel/kthread.c:388)
[   60.819151] ? __pfx_kthread (/home/amd/git/linux/kernel/kthread.c:341)
[   60.823331] ret_from_fork (/home/amd/git/linux/arch/x86/kernel/process.c:147)
[   60.827318] ? __pfx_kthread (/home/amd/git/linux/kernel/kthread.c:341)
[   60.831498] ret_from_fork_asm (/home/amd/git/linux/arch/x86/entry/entry_64.S:250)
[   60.835878]  </TASK>
[   60.838309] Modules linked in: crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 ast(+) i2c_algo_bit drm_shmem_helper hid_generic(+) drm_kms_helper uas dax_hmem nvme usbhid usb_storage drm hid ahci(+) libahci i2c_piix4 nvme_core wmi aesni_intel crypto_simd cryptd
[   60.867920] CR2: ffff8881e98109f0
[   60.871616] ---[ end trace 0000000000000000 ]---

drivers/gpu/drm/drm_probe_helper.c:310 is the
dev->mode_config.poll_running assignment here:

void drm_kms_helper_poll_enable(struct drm_device *dev)
{
     if (!dev->mode_config.poll_enabled || !drm_kms_helper_poll ||
         dev->mode_config.poll_running)
         return;

     if (drm_kms_helper_enable_hpd(dev) ||
         dev->mode_config.delayed_event)
         reschedule_output_poll_work(dev);

     dev->mode_config.poll_running = true;           <<<<< HERE
}
EXPORT_SYMBOL(drm_kms_helper_poll_enable);


Hi,

Thanks for the detailed bug report.

From the call stack, I think the crash is more likely here:
https://elixir.bootlin.com/linux/v6.6/source/kernel/time/timer.c#L605

But the timer looks correctly initialized in
https://elixir.bootlin.com/linux/v6.6/source/drivers/gpu/drm/drm_probe_helper.c#L908

So I'm not sure why it fails in this case.


If I revert commit f81bb0ac7872893241319ea82504956676ef02fd
("drm/ast: report connection status on Display Port."), the splat
goes away:

[   60.603837] input: OpenBMC virtual_input as /devices/pci0000:00/0000:00:07.1/0000:02:00.4/usb3/3-2/3-2.6/3-2.6:1.0/0003:1D6B:0104.0002/input/input4
[   60.651733] ast 0000:c2:00.0: vgaarb: deactivate vga console
[   60.659978]  4k 16711104 large 0 gb 0 x 1303[ffff888000097000-ffff8880a7ffe000] miss 383488
[   60.669321] ok.
[   60.670497] ast 0000:c2:00.0: [drm] Using default configuration
[   60.677894] ast 0000:c2:00.0: [drm] AST 2600 detected
[   60.683545] ast 0000:c2:00.0: [drm] Using ASPEED DisplayPort transmitter
[   60.685381] hid-generic 0003:1D6B:0104.0002: input,hidraw0: USB HID v1.01 Keyboard [OpenBMC virtual_input] on usb-0000:02:00.4-2.6/input0 [   60.691032] ast 0000:c2:00.0: [drm] dram MCLK=396 Mhz type=1 bus_width=16 [   60.697172] [drm] Initialized ast 0.1.0 20120228 for 0000:c2:00.0 on minor 0
[   60.729565] fbcon: astdrmfb (fb0) is primary device
[   60.729570] fbcon: Deferring console take-over
[   60.741322] ast 0000:c2:00.0: [drm] fb0: astdrmfb frame buffer device
[   60.928226] ast 0000:c2:00.0: vgaarb: deactivate vga console
[   60.940376] input: OpenBMC virtual_input as /devices/pci0000:00/0000:00:07.1/0000:02:00.4/usb3/3-2/3-2.6/3-2.6:1.0/0003:1D6B:0104.0002/input/input5
[   60.965436] ast 0000:c2:00.0: [drm] Using default configuration
[   60.972051] ast 0000:c2:00.0: [drm] AST 2600 detected
[   60.977698] ast 0000:c2:00.0: [drm] Using ASPEED DisplayPort transmitter
[   60.985181] ast 0000:c2:00.0: [drm] dram MCLK=396 Mhz type=1 bus_width=16 [   61.000056] [drm] Initialized ast 0.1.0 20120228 for 0000:c2:00.0 on minor 0
[   61.013486] fbcon: Deferring console take-over
[   61.016918] hid-generic 0003:1D6B:0104.0002: input,hidraw0: USB HID v1.01 Keyboard [OpenBMC virtual_input] on usb-0000:02:00.4-2.6/input0
[   61.018454] ast 0000:c2:00.0: [drm] fb0: astdrmfb frame buffer device
[   61.040853] input: OpenBMC virtual_input as /devices/pci0000:00/0000:00:07.1/0000:02:00.4/usb3/3-2/3-2.6/3-2.6:1.1/0003:1D6B:0104.0004/input/input6 [   61.059112] hid-generic 0003:1D6B:0104.0004: input,hidraw1: USB HID v1.01 Mouse [OpenBMC virtual_input] on usb-0000:02:00.4-2.6/input1 [   61.358397] input: OpenBMC virtual_input as /devices/pci0000:00/0000:00:07.1/0000:02:00.4/usb3/3-2/3-2.6/3-2.6:1.1/0003:1D6B:0104.0004/input/input7 [   61.376885] hid-generic 0003:1D6B:0104.0004: input,hidraw1: USB HID v1.01 Mouse [OpenBMC virtual_input] on usb-0000:02:00.4-2.6/input1

This has happened before when drm_kms_helper_poll_init() was added
to an ast connector_init(), see:

The crash was in the detect() callback for that previous case.
This time it crashes when setting the timer, but they still look very similar, thanks for pointing this.

commit 595cb5e0b832a3e100cbbdefef797b0c27bf725a
Author: Kim Phillips <kim.phillips@xxxxxxx>
Date:   Thu Oct 21 10:30:06 2021 -0500

     Revert "drm/ast: Add detect function support"

I'm willing to test any proposed changes, esp. if it means
not reverting this commit, too, because that will only likely
lead to yet another BUG instance if/when another poll_init() gets
added in the future.  Should the FIXME described in
reschedule_output_poll_work() be addressed?


This fixme just change the timer interval from 10s to 1s, so it shouldn't explain this crash.

Can you test with the attached patch ? I want to see if the detect callback is called, and also make sure the delayed_work struct is properly initialized.


Thanks,

Kim

Best regards,

--

Jocelyn
From 0a9f2d81af1087ab3648206dfce160639385a6b6 Mon Sep 17 00:00:00 2001
From: Jocelyn Falempe <jfalempe@xxxxxxxxxx>
Date: Thu, 9 Nov 2023 14:39:22 +0100
Subject: [PATCH] drm/probe-helper: Add debug for AST poll bug

Signed-off-by: Jocelyn Falempe <jfalempe@xxxxxxxxxx>
---
 drivers/gpu/drm/drm_probe_helper.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/drm_probe_helper.c b/drivers/gpu/drm/drm_probe_helper.c
index 3f479483d7d8..9aa95791891a 100644
--- a/drivers/gpu/drm/drm_probe_helper.c
+++ b/drivers/gpu/drm/drm_probe_helper.c
@@ -753,6 +753,8 @@ static void output_poll_execute(struct work_struct *work)
 	bool repoll = false, changed;
 	u64 old_epoch_counter;
 
+	pr_info("ASTDEBUG poll execute delayed_work %p, drm_dev %p\n", delayed_work, dev);
+
 	if (!dev->mode_config.poll_enabled)
 		return;
 
@@ -910,6 +912,8 @@ void drm_kms_helper_poll_init(struct drm_device *dev)
 	INIT_DELAYED_WORK(&dev->mode_config.output_poll_work, output_poll_execute);
 	dev->mode_config.poll_enabled = true;
 
+	pr_info("ASTDEBUG poll init delayed work %p timer %p drm_dev %p timer flag 0x%0x\n", &dev->mode_config.output_poll_work, &dev->mode_config.output_poll_work.timer, dev, dev->mode_config.output_poll_work.timer.flags);
+
 	drm_kms_helper_poll_enable(dev);
 }
 EXPORT_SYMBOL(drm_kms_helper_poll_init);
-- 
2.41.0


[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux