Re: [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive.

Christian König <ckoenig.leichtzumerken@xxxxxxxxx> · Thu, 19 May 2022 09:58:05 +0200



    Am 18.05.22 um 16:24 schrieb Andrey
      Grodzovsky:

    
      On 2022-05-18 02:07, Christian König
        wrote:

      
      Am
        17.05.22 um 21:20 schrieb Andrey Grodzovsky: 

        Problem: 

          During hive reset caused by command timing out on a ring 

          extra resets are generated by triggered by KFD which is 

          unable to accesses registers on the resetting ASIC. 

          
          Fix: Rework GPU reset to actively stop any pending reset 

          works while another in progress. 

          
          v2: Switch from generic list as was in v1[1] to eplicit 

          stopping of each reset request from each reset source 

          per each request submitter. 

        
        Looks mostly good to me. 

        
        Apart from the naming nit pick on patch #1 the only thing I
        couldn't of hand figure out is why you are using a delayed work
        everywhere instead of a just a work item. 

        
        That needs a bit further explanation what's happening here. 

        
        Christian. 

      
      Check APIs for cancelling work vs. delayed work -
      For work_struct the only public API is this - https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3214
        - blocking cancel.
      For delayed_work we have both blocking and non blocking public
        APIs - 

      
      https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295
      https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295
      I prefer not to go now into convincing core kernel people of
        exposing another interface for our own sake - from my past
        experience API changes in core code has slim chances and a lot
        of time spent on back and forth arguments.
      "If the mountain will
          not come to Muhammad, then Muhammad must go to the mountain"
          ;)

        
    Ah, good point. The cancel_work() function was removed a few years
    ago:

    
    commit 6417250d3f894e66a68ba1cd93676143f2376a6f

    Author: Stephen Hemminger <stephen@xxxxxxxxxxxxxxxxxx>

    Date:   Tue Mar 6 19:34:42 2018 -0800

    
        workqueue: remove unused cancel_work()

        
        Found this by accident.

        There are no usages of bare cancel_work() in current kernel
    source.

        
        Signed-off-by: Stephen Hemminger
    <stephen@xxxxxxxxxxxxxxxxxx>

        Signed-off-by: Tejun Heo <tj@xxxxxxxxxx>

    
    Maybe just revert that patch, export the function and use it. I
    think there is plenty of justification for this.

    
    Thanks,

    Christian.

    
      Andrey
       

          [1] -
          https://lore.kernel.org/all/20220504161841.24669-1-andrey.grodzovsky@xxxxxxx/

          
          Andrey Grodzovsky (7): 

             drm/amdgpu: Cache result of last reset at reset domain
          level. 

             drm/amdgpu: Switch to delayed work from work_struct. 

             drm/admgpu: Serialize RAS recovery work directly into reset
          domain 

               queue. 

             drm/amdgpu: Add delayed work for GPU reset from debugfs 

             drm/amdgpu: Add delayed work for GPU reset from kfd. 

             drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to 

               amdgpu_device_gpu_recover 

             drm/amdgpu: Stop any pending reset if another in progress.
          

            drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  4 +- 

            drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 +++++- 

            drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  1 + 

            drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 62
          +++++++++++----------- 

            drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 19 ++++++- 

            drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +- 

            drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    | 10 ++-- 

            drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h    |  2 +- 

            drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c  |  1 + 

            drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  |  5 +- 

            drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  2 +- 

            drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      |  6 +-- 

            drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      |  6 +-- 

            drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |  6 +-- 

            14 files changed, 87 insertions(+), 54 deletions(-)