Re: [PATCH] drm/amdgpu: Check pending job finished or not to identify has bad job

Christian König <christian.koenig@xxxxxxx> · Wed, 13 Nov 2024 10:22:43 +0100



    Hi guys,

    
    can you please explain to me why it's always you guys which come up
    with such nonsense?

    
    When you need to find the number of ongoing hardware submission then
    please use the amdgpu_fence_count_emitted() function and not mess
    with any scheduler internals.

    
    This patch here is a clear NAK from my side.

    
    Regards,

    Christian.

    
    Am 13.11.24 um 09:46 schrieb Fan,
      Shikang:

    
        [AMD Official Use Only - AMD Internal Distribution Only]

      
          +@Koenig, Christian

          
          Hi Christian,

          
          Could you please help review this patch? Thank you.

          
          Regards,
        
          Shikang
        
        From:
            Shikang Fan <shikang.fan@xxxxxxx>

            Sent: Wednesday, November 13, 2024 11:14 AM

            To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
            <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>

            Cc: Fan, Shikang <Shikang.Fan@xxxxxxx>; Liu01,
            Tong (Esther) <Tong.Liu01@xxxxxxx>; Deng, Emily
            <Emily.Deng@xxxxxxx>

            Subject: [PATCH] drm/amdgpu: Check pending job
            finished or not to identify has bad job
           
        
              drm_sched_free_job_work is a queue
                work function,

                so even job is finished in hw, it still needs some time
                to

                be deleted from the pending queue by
                drm_sched_free_job_work.

                here iterates over the pending job list and wait for
                each job to finish

                within specified timeout (1s by default) to avoid jobs
                that are not

                cleaned up in time or are about to finished.

                if wait timeout, return true

                
                Signed-off-by: Tong Liu01 <Tong.Liu01@xxxxxxx>

                Signed-off-by: Emily Deng <Emily.Deng@xxxxxxx>

                Signed-off-by: Shikang Fan <shikang.fan@xxxxxxx>

                ---

                 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 21
                ++++++++++++++++-----

                 1 file changed, 16 insertions(+), 5 deletions(-)

                
                diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
                b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

                index 071d3d9b345d..da2a22618f42 100644

                --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

                +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

                @@ -100,6 +100,7 @@
                MODULE_FIRMWARE("amdgpu/navi12_gpu_info.bin");

                 #define AMDGPU_PCIE_INDEX_FALLBACK (0x38 >> 2)

                 #define AMDGPU_PCIE_INDEX_HI_FALLBACK (0x44 >> 2)

                 #define AMDGPU_PCIE_DATA_FALLBACK (0x3C >> 2)

                +#define AMDGPU_PENDING_JOB_TIMEOUT    
                msecs_to_jiffies(1000)

                 
                 static const struct drm_driver amdgpu_kms_driver;

                 
                @@ -5224,7 +5225,8 @@ static int
                amdgpu_device_reset_sriov(struct amdgpu_device *adev,

                 bool amdgpu_device_has_job_running(struct amdgpu_device
                *adev)

                 {

                         int i;

                -       struct drm_sched_job *job;

                +       struct drm_sched_job *job, *tmp;

                +       long r;

                 
                         for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {

                                 struct amdgpu_ring *ring =
                adev->rings[i];

                @@ -5233,11 +5235,20 @@ bool
                amdgpu_device_has_job_running(struct amdgpu_device
                *adev)

                                         continue;

                 
                spin_lock(&ring->sched.job_list_lock);

                -               job =
                list_first_entry_or_null(&ring->sched.pending_list,

                -                                              struct
                drm_sched_job, list);

                +

                +               /* iterates over the pending job list

                +                * wait for each job to finish within
                timeout (1s by default)

                +                * if wait timeout, return true

                +                */

                +               list_for_each_entry_safe(job, tmp,
                &ring->sched.pending_list, list) {

                +                       r =
                dma_fence_wait_timeout(&job->s_fence->finished,

+                                                               false,
                AMDGPU_PENDING_JOB_TIMEOUT);

                +                       if (r <= 0) {

                +                              
                spin_unlock(&ring->sched.job_list_lock);

                +                               return true;

                +                       }

                +               }

                                
                spin_unlock(&ring->sched.job_list_lock);

                -               if (job)

                -                       return true;

                         }

                         return false;

                 }

                -- 

                2.34.1