Re: [PATCH 4/5] drm/amdgpu: add completion to wait for ras reset to complete

"Lazar, Lijo" <lijo.lazar@xxxxxxx> · Tue, 18 Jun 2024 17:30:11 +0530

On 6/18/2024 4:51 PM, Chai, Thomas wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> -----------------
> Best Regards,
> Thomas
> 
> -----Original Message-----
> From: Chai, Thomas
> Sent: Tuesday, June 18, 2024 7:09 PM
> To: Lazar, Lijo <Lijo.Lazar@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Cc: Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Zhou1, Tao <Tao.Zhou1@xxxxxxx>; Li, Candice <Candice.Li@xxxxxxx>; Wang, Yang(Kevin) <KevinYang.Wang@xxxxxxx>; Yang, Stanley <Stanley.Yang@xxxxxxx>
> Subject: RE: [PATCH 4/5] drm/amdgpu: add completion to wait for ras reset to complete
> 
> 
> 
> 
> -----------------
> Best Regards,
> Thomas
> 
> -----Original Message-----
> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
> Sent: Tuesday, June 18, 2024 6:09 PM
> To: Chai, Thomas <YiPeng.Chai@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Cc: Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Zhou1, Tao <Tao.Zhou1@xxxxxxx>; Li, Candice <Candice.Li@xxxxxxx>; Wang, Yang(Kevin) <KevinYang.Wang@xxxxxxx>; Yang, Stanley <Stanley.Yang@xxxxxxx>
> Subject: Re: [PATCH 4/5] drm/amdgpu: add completion to wait for ras reset to complete
> 
> 
> 
> On 6/18/2024 12:03 PM, YiPeng Chai wrote:
>> Add completion to wait for ras reset to complete.
>>
>> Signed-off-by: YiPeng Chai <YiPeng.Chai@xxxxxxx>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 11 +++++++++++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  1 +
>>  2 files changed, 12 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> index 898889600771..7f8e6ca07957 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> @@ -124,6 +124,8 @@ const char *get_ras_block_str(struct ras_common_if
>> *ras_block)
>>
>>  #define AMDGPU_RAS_RETIRE_PAGE_INTERVAL 100  //ms
>>
>> +#define MAX_RAS_RECOVERY_COMPLETION_TIME  120000 //ms
>> +
>>  enum amdgpu_ras_retire_page_reservation {
>>       AMDGPU_RAS_RETIRE_PAGE_RESERVED,
>>       AMDGPU_RAS_RETIRE_PAGE_PENDING,
>> @@ -2518,6 +2520,8 @@ static void amdgpu_ras_do_recovery(struct work_struct *work)
>>               atomic_set(&hive->ras_recovery, 0);
>>               amdgpu_put_xgmi_hive(hive);
>>       }
>> +
>> +     complete_all(&ras->ras_recovery_completion);
>>  }
>>
>>  /* alloc/realloc bps array */
>> @@ -2911,10 +2915,16 @@ static int
>> amdgpu_ras_poison_consumption_handler(struct amdgpu_device *adev,
>>
>>               flush_delayed_work(&con->page_retirement_dwork);
>>
>> +             reinit_completion(&con->ras_recovery_completion);
>> +
>>               con->gpu_reset_flags |= reset;
>>               amdgpu_ras_reset_gpu(adev);
>>
>>               *gpu_reset = reset;
>> +             if (!wait_for_completion_timeout(&con->ras_recovery_completion,
>> +                             msecs_to_jiffies(MAX_RAS_RECOVERY_COMPLETION_TIME)))
>> +                     dev_err(adev->dev, "Waiting for GPU to complete ras reset timeout! reset:0x%x\n",
>> +                             reset);
> 
>> If a mode-1 reset gets to execute first due to job timeout/hws detect cases in poison timeout, then the ras handler will never get executed.
>> Why this wait is required?
> 
>> Thanks,
>> Lijo
> 
> [Thomas]  "[PATCH 5/5] drm/amdgpu: add gpu reset check and exception handling" add the check before ras gpu reset.
>                 Poison ras reset is different from reset triggered by other fatal errors, and all poison RAS resets are triggered from here,
>              in order to distinguish other gpu resets and facilitate subsequent  code processing, so add wait for gpu ras reset here.
> 

Reset mechanism resets the GPU state - whether it's triggered due to
poison or fatal errors. As soon as the device is reset successfully, GPU
operations can continue. So why there needs to be a special wait for
poison triggred reset alone? Why not wait on the RAS recovery work
object rather than another completion notification?

Thanks,
Lijo

>>       }
>>
>>       return 0;
>> @@ -3041,6 +3051,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
>>               }
>>       }
>>
>> +     init_completion(&con->ras_recovery_completion);
>>       mutex_init(&con->page_rsv_lock);
>>       INIT_KFIFO(con->poison_fifo);
>>       mutex_init(&con->page_retirement_lock);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> index 91daf48be03a..b47f03edac87 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> @@ -537,6 +537,7 @@ struct amdgpu_ras {
>>       DECLARE_KFIFO(poison_fifo, struct ras_poison_msg, 128);
>>       struct ras_ecc_log_info  umc_ecc_log;
>>       struct delayed_work page_retirement_dwork;
>> +     struct completion ras_recovery_completion;
>>
>>       /* Fatal error detected flag */
>>       atomic_t fed;