Patch "drm/sched: Check scheduler work queue before calling timeout handling" has been added to the 6.3-stable tree

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This is a note to let you know that I've just added the patch titled

    drm/sched: Check scheduler work queue before calling timeout handling

to the 6.3-stable tree which can be found at:
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
     drm-sched-check-scheduler-work-queue-before-calling-.patch
and it can be found in the queue-6.3 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@xxxxxxxxxxxxxxx> know about it.



commit 71dd79df1cac03bdd90c094a3fc3deeb35fd74e6
Author: Vitaly Prosyak <vitaly.prosyak@xxxxxxx>
Date:   Wed May 10 09:51:11 2023 -0400

    drm/sched: Check scheduler work queue before calling timeout handling
    
    [ Upstream commit 2da5bffe9eaa5819a868e8eaaa11b3fd0f16a691 ]
    
    During an IGT GPU reset test we see again oops despite of
    commit 0c8c901aaaebc9 (drm/sched: Check scheduler ready before calling
    timeout handling).
    
    It uses ready condition whether to call drm_sched_fault which unwind
    the TDR leads to GPU reset.
    However it looks the ready condition is overloaded with other meanings,
    for example, for the following stack is related GPU reset :
    
    0  gfx_v9_0_cp_gfx_start
    1  gfx_v9_0_cp_gfx_resume
    2  gfx_v9_0_cp_resume
    3  gfx_v9_0_hw_init
    4  gfx_v9_0_resume
    5  amdgpu_device_ip_resume_phase2
    
    does the following:
            /* start the ring */
            gfx_v9_0_cp_gfx_start(adev);
            ring->sched.ready = true;
    
    The same approach is for other ASICs as well :
    gfx_v8_0_cp_gfx_resume
    gfx_v10_0_kiq_resume, etc...
    
    As a result, our GPU reset test causes GPU fault which calls unconditionally gfx_v9_0_fault
    and then drm_sched_fault. However now it depends on whether the interrupt service routine
    drm_sched_fault is executed after gfx_v9_0_cp_gfx_start is completed which sets the ready
    field of the scheduler to true even  for uninitialized schedulers and causes oops vs
    no fault or when ISR  drm_sched_fault is completed prior  gfx_v9_0_cp_gfx_start and
    NULL pointer dereference does not occur.
    
    Use the field timeout_wq  to prevent oops for uninitialized schedulers.
    The field could be initialized by the work queue of resetting the domain.
    
    v1: Corrections to commit message (Luben)
    
    Fixes: 11b3b9f461c5c4 ("drm/sched: Check scheduler ready before calling timeout handling")
    Signed-off-by: Vitaly Prosyak <vitaly.prosyak@xxxxxxx>
    Link: https://lore.kernel.org/r/20230510135111.58631-1-vitaly.prosyak@xxxxxxx
    Reviewed-by: Luben Tuikov <luben.tuikov@xxxxxxx>
    Signed-off-by: Luben Tuikov <luben.tuikov@xxxxxxx>
    Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 1e08cc5a17029..78c959eaef0c5 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -308,7 +308,7 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
  */
 void drm_sched_fault(struct drm_gpu_scheduler *sched)
 {
-	if (sched->ready)
+	if (sched->timeout_wq)
 		mod_delayed_work(sched->timeout_wq, &sched->work_tdr, 0);
 }
 EXPORT_SYMBOL(drm_sched_fault);



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux