Re: [Regression] CPU stalls and eventually causes a complete system freeze with 6.0.3 due to "video/aperture: Disable and unregister sysfb devices via aperture helpers"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Andreas

Am 24.10.22 um 18:19 schrieb Andreas Thalhammer:
Am 24.10.22 um 13:31 schrieb Thomas Zimmermann:
Hi

Am 24.10.22 um 13:27 schrieb Greg KH:
On Mon, Oct 24, 2022 at 12:41:43PM +0200, Thorsten Leemhuis wrote:
Hi! Thx for the reply.

On 24.10.22 12:26, Thomas Zimmermann wrote:
Am 23.10.22 um 10:04 schrieb Thorsten Leemhuis:

I noticed a regression report in bugzilla.kernel.org. As many (most?)
kernel developer don't keep an eye on it, I decided to forward it by
mail. Quoting from
https://bugzilla.kernel.org/show_bug.cgi?id=216616 ; :

   Andreas 2022-10-22 14:25:32 UTC

Created attachment 303074 [details]
dmesg

I've looked at the kernel log and found that simpledrm has been loaded
*after* amdgpu, which should never happen. The problematic patch has
been taken from a long list of refactoring work on this code. No wonder
that it doesn't work as expected.

Please cherry-pick commit 9d69ef183815 ("fbdev/core: Remove
remove_conflicting_pci_framebuffers()") into the 6.0 stable branch and
report on the results. It should fix the problem.

Greg, is that enough for you to pick this up? Or do you want Andreas to
test first if it really fixes the reported problem?

This should be good enough.  If this does NOT fix the issue, please let
me know.

Thanks a lot. I think I can provided a dedicated fix if the proposed
commit doesn't work.

Best regards
Thomas


thanks,

greg k-h


Thanks... In short: the additional patch did NOT fix the problem.

Yeah, it's also part of a larger changeset. But I wouldn't want to backport all those changes either.

Attached is a simple patch for linux-stable that adds the necessary fix. If this still doesn't work, we should probably revert the problematic patch.

Please test the patch and let me know if it works.

Best regards
Thomas


I don't use git and I don't know how to /cherry-pick commit/
9d69ef183815, but I found the patch here:
https://patchwork.freedesktop.org/patch/494609/

I hope that's the right one. I reintegrated
v2-07-11-video-aperture-Disable-and-unregister-sysfb-devices-via-aperture-helpers.patch
and also applied
v2-04-11-fbdev-core-Remove-remove_conflicting_pci_framebuffers.patch,
did a "make mrproper" and thereafter compiled a clean new 6.0.3 kernel
(same .config).

Now the system doesn't even boot to a console. The first boot got me to
a rcu_shed stall on CPUs/tasks, same as above, but this time with:
Workqueue: btrfs-cache btrfs_work_helper

I booted a second time with the same kernel, and it got stuck after
mounting the root btrfs filesystem (what looked like a total freeze, but
when it didn't show a rcu_stall message after ~2 min I got impatient and
wanted to see if I had just busted my root filesystem...)

I booted 6.0.2 and everything is fine. (I'm very glad! I definitely
should update my backup right away!)

I will try 6.1-rc1 next, bear with...


--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Ivo Totev
From ba55e238e64817a2369a267153a5b980683465a1 Mon Sep 17 00:00:00 2001
From: Thomas Zimmermann <tzimmermann@xxxxxxx>
Date: Tue, 25 Oct 2022 09:38:44 +0200
Subject: [PATCH] video/aperture: Call sysfb_disable() before removing PCI
 devices

Call sysfb_disable() from aperture_remove_conflicting_pci_devices()
before removing PCI devices. Without, simpledrm can still bind to
simple-framebuffer devices after the hardware driver has taken over
the hardware. Both drivers interfere with each other and results are
undefined.

Reported modesetting errors are shown below.

---- snap ----
rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 13-.... } 7 jiffies s: 165 root: 0x2000/.
rcu: blocking rcu_node structures (internal RCU debug):
Task dump for CPU 13:
task:X               state:R  running task     stack:    0 pid: 4242 ppid:  4228 flags:0x00000008
Call Trace:
 <TASK>
 ? commit_tail+0xd7/0x130
 ? drm_atomic_helper_commit+0x126/0x150
 ? drm_atomic_commit+0xa4/0xe0
 ? drm_plane_get_damage_clips.cold+0x1c/0x1c
 ? drm_atomic_helper_dirtyfb+0x19e/0x280
 ? drm_mode_dirtyfb_ioctl+0x10f/0x1e0
 ? drm_mode_getfb2_ioctl+0x2d0/0x2d0
 ? drm_ioctl_kernel+0xc4/0x150
 ? drm_ioctl+0x246/0x3f0
 ? drm_mode_getfb2_ioctl+0x2d0/0x2d0
 ? __x64_sys_ioctl+0x91/0xd0
 ? do_syscall_64+0x60/0xd0
 ? entry_SYSCALL_64_after_hwframe+0x4b/0xb5
 </TASK>
...
rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 13-.... } 30 jiffies s: 169 root: 0x2000/.
rcu: blocking rcu_node structures (internal RCU debug):
Task dump for CPU 13:
task:X               state:R  running task     stack:    0 pid: 4242 ppid:  4228 flags:0x0000400e
Call Trace:
 <TASK>
 ? memcpy_toio+0x76/0xc0
 ? memcpy_toio+0x1b/0xc0
 ? drm_fb_memcpy_toio+0x76/0xb0
 ? drm_fb_blit_toio+0x75/0x2b0
 ? simpledrm_simple_display_pipe_update+0x132/0x150
 ? drm_atomic_helper_commit_planes+0xb6/0x230
 ? drm_atomic_helper_commit_tail+0x44/0x80
 ? commit_tail+0xd7/0x130
 ? drm_atomic_helper_commit+0x126/0x150
 ? drm_atomic_commit+0xa4/0xe0
 ? drm_plane_get_damage_clips.cold+0x1c/0x1c
 ? drm_atomic_helper_dirtyfb+0x19e/0x280
 ? drm_mode_dirtyfb_ioctl+0x10f/0x1e0
 ? drm_mode_getfb2_ioctl+0x2d0/0x2d0
 ? drm_ioctl_kernel+0xc4/0x150
 ? drm_ioctl+0x246/0x3f0
 ? drm_mode_getfb2_ioctl+0x2d0/0x2d0
 ? __x64_sys_ioctl+0x91/0xd0
 ? do_syscall_64+0x60/0xd0
 ? entry_SYSCALL_64_after_hwframe+0x4b/0xb5
 </TASK>

The problem was introduced by backporting commit 5e0137612430
("video/aperture: Disable and unregister sysfb devices via aperture
 helpers") to v6.0.3 and does not exist in the mainline branch.

Reported-by: Andreas Thalhammer <andreas.thalhammer-linux@xxxxxxx>
Reported-by: Thorsten Leemhuis <regressions@xxxxxxxxxxxxx>
Signed-off-by: Thomas Zimmermann <tzimmermann@xxxxxxx>
Fixes: cfecfc98a78d ("video/aperture: Disable and unregister sysfb devices via aperture helpers")
Cc: Thomas Zimmermann <tzimmermann@xxxxxxx>
Cc: Javier Martinez Canillas <javierm@xxxxxxxxxx>
Cc: Zack Rusin <zackr@xxxxxxxxxx>
Cc: Daniel Vetter <daniel.vetter@xxxxxxxx>
Cc: Daniel Vetter <daniel@xxxxxxxx>
Cc: Sam Ravnborg <sam@xxxxxxxxxxxx>
Cc: Helge Deller <deller@xxxxxx>
Cc: Alex Deucher <alexander.deucher@xxxxxxx>
Cc: Zhen Lei <thunder.leizhen@xxxxxxxxxx>
Cc: Changcheng Deng <deng.changcheng@xxxxxxxxxx>
Cc: Maarten Lankhorst <maarten.lankhorst@xxxxxxxxxxxxxxx>
Cc: Maxime Ripard <mripard@xxxxxxxxxx>
Cc: dri-devel@xxxxxxxxxxxxxxxxxxxxx
Cc: Sasha Levin <sashal@xxxxxxxxxx>
Cc: linux-fbdev@xxxxxxxxxxxxxxx
Cc: <stable@xxxxxxxxxxxxxxx> # v6.0.3+
Link: https://lore.kernel.org/dri-devel/d6afe54b-f8d7-beb2-3609-186e566cbfac@xxxxxxx/T/#t
---
 drivers/video/aperture.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/video/aperture.c b/drivers/video/aperture.c
index d245826a9324d..cc6427a091bc7 100644
--- a/drivers/video/aperture.c
+++ b/drivers/video/aperture.c
@@ -338,6 +338,17 @@ int aperture_remove_conflicting_pci_devices(struct pci_dev *pdev, const char *na
 	resource_size_t base, size;
 	int bar, ret;
 
+	/*
+	 * If a driver asked to unregister a platform device registered by
+	 * sysfb, then can be assumed that this is a driver for a display
+	 * that is set up by the system firmware and has a generic driver.
+	 *
+	 * Drivers for devices that don't have a generic driver will never
+	 * ask for this, so let's assume that a real driver for the display
+	 * was already probed and prevent sysfb to register devices later.
+	 */
+	sysfb_disable();
+
 	/*
 	 * WARNING: Apparently we must kick fbdev drivers before vgacon,
 	 * otherwise the vga fbdev driver falls over.
-- 
2.38.0

Attachment: OpenPGP_signature
Description: OpenPGP digital signature


[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux