Hello, > From: Zhuo, Qiuxu <qiuxu.zhuo@xxxxxxxxx> > Sent: Friday, May 31, 2024 3:45 PM > To: maarten.lankhorst@xxxxxxxxxxxxxxx; mripard@xxxxxxxxxx; > tzimmermann@xxxxxxx; airlied@xxxxxxxxx; daniel@xxxxxxxx > Cc: dri-devel@xxxxxxxxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; Luck, Tony > <tony.luck@xxxxxxxxx>; Zhuo, Qiuxu <qiuxu.zhuo@xxxxxxxxx>; Wang, Yudong > <yudong.wang@xxxxxxxxx> > Subject: [PATCH 1/1] drm/fb-helper: Don't schedule_work() to flush frame > buffer during panic() > > Sometimes the system [1] hangs on x86 I/O machine checks. However, the > expected behavior is to reboot the system, as the machine check handler > ultimately triggers a panic(), initiating a reboot in the last step. > > The root cause is that sometimes the panic() is blocked when > drm_fb_helper_damage() invoking schedule_work() to flush the frame buffer. > This occurs during the process of flushing all messages to the frame buffer > driver as shown in the following call trace: > > Machine check occurs [2]: > panic() > console_flush_on_panic() > console_flush_all() > console_emit_next_record() > con->write() > vt_console_print() > hide_cursor() > vc->vc_sw->con_cursor() > fbcon_cursor() > ops->cursor() > bit_cursor() > soft_cursor() > info->fbops->fb_imageblit() > drm_fbdev_generic_defio_imageblit() > drm_fb_helper_damage_area() > drm_fb_helper_damage() > schedule_work() // <--- blocked here > ... > emergency_restart() // wasn't invoked, so no reboot. > > During panic(), except the panic CPU, all the other CPUs are stopped. > In schedule_work(), the panic CPU requires the lock of worker_pool to queue > the work on that pool, while the lock may have been token by some other > stopped CPU. So schedule_work() is blocked. > > Additionally, during a panic(), since there is no opportunity to execute any > scheduled work, it's safe to fix this issue by skipping schedule_work() on > 'oops_in_progress' in drm_fb_helper_damage(). > > [1] Enable the kernel option CONFIG_FRAMEBUFFER_CONSOLE, > CONFIG_DRM_FBDEV_EMULATION, and boot with the 'console=tty0' > kernel command line parameter. > > [2] Set 'panic_timeout' to a non-zero value before calling panic(). > > Reported-by: Yudong Wang <yudong.wang@xxxxxxxxx> > Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@xxxxxxxxx> > --- > drivers/gpu/drm/drm_fb_helper.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/gpu/drm/drm_fb_helper.c > b/drivers/gpu/drm/drm_fb_helper.c index d612133e2cf7..6d7b6f038821 > 100644 > --- a/drivers/gpu/drm/drm_fb_helper.c > +++ b/drivers/gpu/drm/drm_fb_helper.c > @@ -628,6 +628,9 @@ static void drm_fb_helper_add_damage_clip(struct > drm_fb_helper *helper, u32 x, u static void drm_fb_helper_damage(struct > drm_fb_helper *helper, u32 x, u32 y, > u32 width, u32 height) > { > + if (oops_in_progress) > + return; > + > drm_fb_helper_add_damage_clip(helper, x, y, width, height); > > schedule_work(&helper->damage_work); > -- A gentle ping on this patch. Updated with recent error injection test results: - Without the patch, we typically reproduced the issue [1] once in 100 cycles. - With the patch, we tested it on 3 systems and passed a total of 1500 cycles. [1] the system got blocked at drm_fb_helper_damage()-> schedule_work() without reboot. For details, please see the commit message. Thanks! -Qiuxu