On 12/15/22 16:02, Alex Hung wrote: > > > On 2022-12-15 08:17, Harry Wentland wrote: >> >> >> On 12/15/22 05:29, Michel Dänzer wrote: >>> On 12/15/22 09:09, Christian König wrote: >>>> Am 15.12.22 um 00:33 schrieb Alex Hung: >>>>> On 2022-12-14 16:06, Alex Deucher wrote: >>>>>> On Wed, Dec 14, 2022 at 5:56 PM Alex Hung <alex.hung@xxxxxxx> wrote: >>>>>>> On 2022-12-14 15:35, Alex Deucher wrote: >>>>>>>> On Wed, Dec 14, 2022 at 5:25 PM Alex Hung <alex.hung@xxxxxxx> wrote: >>>>>>>>> On 2022-12-14 14:54, Alex Deucher wrote: >>>>>>>>>> On Wed, Dec 14, 2022 at 4:50 PM Alex Hung <alex.hung@xxxxxxx> wrote: >>>>>>>>>>> On 2022-12-14 13:48, Alex Deucher wrote: >>>>>>>>>>>> On Wed, Dec 14, 2022 at 3:22 PM Aurabindo Pillai >>>>>>>>>>>> <aurabindo.pillai@xxxxxxx> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> From: Alex Hung <alex.hung@xxxxxxx> >>>>>>>>>>>>> <snip> >> >> It can come through handle_hpd_rx_irq but we're using a workqueue >> to queue interrupt handling so this shouldn't come from an atomic >> context. I currently don't see where else it might be used in an >> atomic context. Alex Hung, can you do a dump_stack() in this function >> to see where the problematic call is coming from? > > > IGT's kms_bw executes as below (when passing) > > IGT-Version: 1.26-gf4067678 (x86_64) (Linux: 5.19.0-99-custom x86_64) > Starting subtest: linear-tiling-1-displays-1920x1080p > Subtest linear-tiling-1-displays-1920x1080p: SUCCESS (0.225s) > Starting subtest: linear-tiling-1-displays-2560x1440p > Subtest linear-tiling-1-displays-2560x1440p: SUCCESS (0.111s) > Starting subtest: linear-tiling-1-displays-3840x2160p > Subtest linear-tiling-1-displays-3840x2160p: SUCCESS (0.118s) > Starting subtest: linear-tiling-2-displays-1920x1080p > Subtest linear-tiling-2-displays-1920x1080p: SUCCESS (0.409s) > Starting subtest: linear-tiling-2-displays-2560x1440p > Subtest linear-tiling-2-displays-2560x1440p: SUCCESS (0.417s) > Starting subtest: linear-tiling-2-displays-3840x2160p > Subtest linear-tiling-2-displays-3840x2160p: SUCCESS (0.444s) > Starting subtest: linear-tiling-3-displays-1920x1080p > Subtest linear-tiling-3-displays-1920x1080p: SUCCESS (0.547s) > Starting subtest: linear-tiling-3-displays-2560x1440p > Subtest linear-tiling-3-displays-2560x1440p: SUCCESS (0.555s) > Starting subtest: linear-tiling-3-displays-3840x2160p > Subtest linear-tiling-3-displays-3840x2160p: SUCCESS (0.586s) > Starting subtest: linear-tiling-4-displays-1920x1080p > Subtest linear-tiling-4-displays-1920x1080p: SUCCESS (0.734s) > Starting subtest: linear-tiling-4-displays-2560x1440p > Subtest linear-tiling-4-displays-2560x1440p: SUCCESS (0.742s) > Starting subtest: linear-tiling-4-displays-3840x2160p > Subtest linear-tiling-4-displays-3840x2160p: SUCCESS (0.778s) > Starting subtest: linear-tiling-5-displays-1920x1080p > Subtest linear-tiling-5-displays-1920x1080p: SUCCESS (0.734s) > Starting subtest: linear-tiling-5-displays-2560x1440p > Subtest linear-tiling-5-displays-2560x1440p: SUCCESS (0.743s) > Starting subtest: linear-tiling-5-displays-3840x2160p > Subtest linear-tiling-5-displays-3840x2160p: SUCCESS (0.781s) > Starting subtest: linear-tiling-6-displays-1920x1080p > Test requirement not met in function run_test_linear_tiling, file ../tests/kms_bw.c:156: > Test requirement: !(pipe > num_pipes) > ASIC does not have 5 pipes Does this IGT patch fix the !(pipe > num_pipes) issue? https://gitlab.freedesktop.org/hwentland/igt-gpu-tools/-/commit/3bb9ee157642ec2433f01b51e09866da2b0f3dd8 > Subtest linear-tiling-6-displays-1920x1080p: SKIP (0.000s) > Starting subtest: linear-tiling-6-displays-2560x1440p > Test requirement not met in function run_test_linear_tiling, file ../tests/kms_bw.c:156: > Test requirement: !(pipe > num_pipes) > ASIC does not have 5 pipes > Subtest linear-tiling-6-displays-2560x1440p: SKIP (0.000s) > Starting subtest: linear-tiling-6-displays-3840x2160p > Test requirement not met in function run_test_linear_tiling, file ../tests/kms_bw.c:156: > Test requirement: !(pipe > num_pipes) > ASIC does not have 5 pipes > Subtest linear-tiling-6-displays-3840x2160p: SKIP (0.000s) > > The crash usually occurs when executing "linear-tiling-3-displays-1920x1080p" most of time, but the crash can also occurs at "linear-tiling-3-displays-2560x1440p" > > ============ > This is dump_stack right before the failing msleep. > > [IGT] kms_bw: starting subtest linear-tiling-3-displays-1920x1080p > CPU: 1 PID: 76 Comm: kworker/1:1 Not tainted 5.19.0-99-custom #126 > Workqueue: events drm_mode_rmfb_work_fn [drm] > Call Trace: > <TASK> > dump_stack_lvl+0x49/0x63 > dump_stack+0x10/0x16 > dce110_blank_stream.cold+0x5/0x14 [amdgpu] > core_link_disable_stream+0xe0/0x6b0 [amdgpu] > ? optc1_set_vtotal_min_max+0x6b/0x80 [amdgpu] > dcn31_reset_hw_ctx_wrap+0x229/0x410 [amdgpu] > dce110_apply_ctx_to_hw+0x6e/0x6c0 [amdgpu] > ? dcn20_plane_atomic_disable+0xb2/0x160 [amdgpu] > ? dcn20_disable_plane+0x2c/0x60 [amdgpu] > ? dcn20_post_unlock_program_front_end+0x77/0x2c0 [amdgpu] > dc_commit_state_no_check+0x39a/0xcd0 [amdgpu] > ? dc_validate_global_state+0x2ba/0x330 [amdgpu] > dc_commit_state+0x10f/0x150 [amdgpu] > amdgpu_dm_atomic_commit_tail+0x631/0x2d30 [amdgpu] This doesn't look like an atomic (i.e., high IRQ) context to me. So it's more likely there is some race going on, like Alex Deucher suggested. Harry > ? dcn30_internal_validate_bw+0x349/0xa00 [amdgpu] > ? slab_post_alloc_hook+0x53/0x270 > ? dcn31_validate_bandwidth+0x12c/0x2b0 [amdgpu] > ? dcn31_validate_bandwidth+0x12c/0x2b0 [amdgpu] > ? dc_validate_global_state+0x27a/0x330 [amdgpu] > ? slab_post_alloc_hook+0x53/0x270 > ? __kmalloc+0x18c/0x300 > ? drm_dp_mst_atomic_setup_commit+0x8a/0x1a0 [drm_display_helper] > ? preempt_count_add+0x7c/0xc0 > ? _raw_spin_unlock_irq+0x1f/0x40 > ? wait_for_completion_timeout+0x114/0x140 > ? preempt_count_add+0x7c/0xc0 > ? _raw_spin_unlock_irq+0x1f/0x40 > commit_tail+0x99/0x140 [drm_kms_helper] > drm_atomic_helper_commit+0x12b/0x150 [drm_kms_helper] > drm_atomic_commit+0x79/0x100 [drm] > ? drm_plane_get_damage_clips.cold+0x1d/0x1d [drm] > drm_framebuffer_remove+0x499/0x510 [drm] > drm_mode_rmfb_work_fn+0x4b/0x90 [drm] > process_one_work+0x21d/0x3f0 > worker_thread+0x1fa/0x3c0 > ? process_one_work+0x3f0/0x3f0 > kthread+0xff/0x130 > ? kthread_complete_and_exit+0x20/0x20 > ret_from_fork+0x22/0x30 > </TASK> > > > ============ > If msleep is replaced by mdelay, the dump_stack is almost the same: > > $ diff mdelay.log msleep.log > < dce110_blank_stream.cold+0x5/0x1f [amdgpu] > --- >> dce110_blank_stream.cold+0x5/0x14 [amdgpu] > > > ============ > If the IGT fix is applied (i.e. no crashes when removing buffers[i] reversely by "igt_remove_fb", the dump_stack outputs are the same: > > $ diff msleep_igt.log msleep.log > 2c2 > < CPU: 6 PID: 78 Comm: kworker/6:1 Not tainted 5.19.0-99-custom #126 > --- >> CPU: 1 PID: 76 Comm: kworker/1:1 Not tainted 5.19.0-99-custom #126 > > ============ > Note the msleep doesn't always trigger crashes. One of the passing dump_stack is as below: > > [IGT] kms_bw: starting subtest linear-tiling-2-displays-1920x1080p > CPU: 6 PID: 78 Comm: kworker/6:1 Not tainted 5.19.0-99-custom #126 > Workqueue: events drm_mode_rmfb_work_fn [drm] > Call Trace: > <TASK> > dump_stack_lvl+0x49/0x63 > dump_stack+0x10/0x16 > dce110_blank_stream.cold+0x5/0x14 [amdgpu] > core_link_disable_stream+0xe0/0x6b0 [amdgpu] > ? optc1_set_vtotal_min_max+0x6b/0x80 [amdgpu] > dcn31_reset_hw_ctx_wrap+0x229/0x410 [amdgpu] > dce110_apply_ctx_to_hw+0x6e/0x6c0 [amdgpu] > ? dcn20_plane_atomic_disable+0xb2/0x160 [amdgpu] > ? dcn20_disable_plane+0x2c/0x60 [amdgpu] > ? dcn20_post_unlock_program_front_end+0x77/0x2c0 [amdgpu] > dc_commit_state_no_check+0x39a/0xcd0 [amdgpu] > ? dc_validate_global_state+0x2ba/0x330 [amdgpu] > dc_commit_state+0x10f/0x150 [amdgpu] > amdgpu_dm_atomic_commit_tail+0x631/0x2d30 [amdgpu] > ? debug_smp_processor_id+0x17/0x20 > ? dc_fpu_end+0x4e/0xd0 [amdgpu] > ? dcn316_populate_dml_pipes_from_context+0x69/0x2a0 [amdgpu] > ? dcn31_calculate_wm_and_dlg_fp+0x50/0x530 [amdgpu] > ? kernel_fpu_end+0x26/0x50 > ? dc_fpu_end+0x4e/0xd0 [amdgpu] > ? dcn31_validate_bandwidth+0x12c/0x2b0 [amdgpu] > ? dcn31_validate_bandwidth+0x12c/0x2b0 [amdgpu] > ? dc_validate_global_state+0x27a/0x330 [amdgpu] > ? slab_post_alloc_hook+0x53/0x270 > ? __kmalloc+0x18c/0x300 > ? drm_dp_mst_atomic_setup_commit+0x8a/0x1a0 [drm_display_helper] > ? preempt_count_add+0x7c/0xc0 > ? _raw_spin_unlock_irq+0x1f/0x40 > ? wait_for_completion_timeout+0x114/0x140 > ? preempt_count_add+0x7c/0xc0 > ? _raw_spin_unlock_irq+0x1f/0x40 > commit_tail+0x99/0x140 [drm_kms_helper] > drm_atomic_helper_commit+0x12b/0x150 [drm_kms_helper] > drm_atomic_commit+0x79/0x100 [drm] > ? drm_plane_get_damage_clips.cold+0x1d/0x1d [drm] > drm_framebuffer_remove+0x499/0x510 [drm] > drm_mode_rmfb_work_fn+0x4b/0x90 [drm] > process_one_work+0x21d/0x3f0 > worker_thread+0x1fa/0x3c0 > ? process_one_work+0x3f0/0x3f0 > kthread+0xff/0x130 > ? kthread_complete_and_exit+0x20/0x20 > ret_from_fork+0x22/0x30 > </TASK> > > > >> >> Fixing IGT will only mask the issue. Userspace should never be able >> to put the system in a state where it stops responding entirely. This >> will need some sort of fix in the kernel. >> >> Harry >> >>