[AMD Official Use Only - General] >-----Original Message----- >From: Sharma, Shashank <Shashank.Sharma@xxxxxxx> >Sent: Wednesday, April 3, 2024 3:19 PM >To: Yu, Lang <Lang.Yu@xxxxxxx>; Christian König ><ckoenig.leichtzumerken@xxxxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Koenig, Christian ><Christian.Koenig@xxxxxxx> >Subject: Re: [PATCH] drm/amdgpu: add post reset IP callback > >Hey Lang, > >On 03/04/2024 08:51, Yu, Lang wrote: >> [AMD Official Use Only - General] >> >>> -----Original Message----- >>> From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> >>> Sent: Tuesday, April 2, 2024 9:38 PM >>> To: Yu, Lang <Lang.Yu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Koenig, Christian >>> <Christian.Koenig@xxxxxxx>; Sharma, Shashank >>> <Shashank.Sharma@xxxxxxx> >>> Subject: Re: [PATCH] drm/amdgpu: add post reset IP callback >>> >>> Am 28.03.24 um 05:40 schrieb Lang Yu: >>>> There are use cases which need full GPU functionality (e.g., VM >>>> update, TLB inavildate) when doing a GPU reset. >>>> >>>> Especially, the mes/umsch self tests which help validate the hw >>>> state after hw init like ring/ib tests. >>> I noted that before but just to repeat it once more: We can't do any >>> MES or UMSCH validation while doing a GPU reset! >> Yes, we can just easily disable it if it doesn't work well. >> But it doesn't take too much effort to make it work. >> It can expose issues as soon as possible and is useful for debugging >purpose. >IMO, its not that useful for debugging as well. In case of a problem, It will just >timeout waiting for MES packet write and we will still have to find out the >actual problem which caused MES to go into bad state in the last GPU reset. >> >>> The ring and IB tests use some pre-allocated memory we put aside for >>> the task during driver load and so can execute during GPU reset as well. >> If user space can create a VM and allocate memory during GPU reset, it >> makes no sense to prevent kernel space from doing that. > >I think the objection here is mostly about why to do it at all, when it is not >helpful. It would be just a maintenance overhead. If you think it is not helpful, why doing ring/ib tests? I don't think such sanity test is not helpful. I only talk UMSCH test(different with MES test) here, I don't think it has a maintenance overhead. Regards, Lang >- Shashank > >>> But for the MES/UMSCH we need a full blown environment with VM and >>> submission infrastructure and setting that up isn't possible here. >> At least for UMSCH test, it only uses VM mapping functionality (if we >> can create a VM with cpu update mode, that's enough), it doesn't use >> other submission functionality. >> It is actually a compute context. >> >> >> Regards, >> Lang >> >>> Adding Shashank as well, but I think we should probably just >>> completely remove those from the kernel. >>> >>> Regards, >>> Christian. >>> >>>> Add a post reset IP callback to handle such use cases which will be >>>> executed after GPU reset succeeds. >>>> >>>> Signed-off-by: Lang Yu <Lang.Yu@xxxxxxx> >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24 >>> ++++++++++++++++++++++ >>>> drivers/gpu/drm/amd/include/amd_shared.h | 3 +++ >>>> 2 files changed, 27 insertions(+) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> index 12dc71a6b5db..feeab9397aab 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> @@ -5556,6 +5556,27 @@ static int >amdgpu_device_health_check(struct >>> list_head *device_list_handle) >>>> return ret; >>>> } >>>> >>>> +static int amdgpu_device_ip_post_reset(struct amdgpu_device *adev) { >>>> + uint32_t i; >>>> + int r; >>>> + >>>> + for (i = 0; i < adev->num_ip_blocks; i++) { >>>> + if (!adev->ip_blocks[i].status.valid || >>>> + !adev->ip_blocks[i].version->funcs->post_reset) >>>> + continue; >>>> + >>>> + r = adev->ip_blocks[i].version->funcs->post_reset(adev); >>>> + if (r) { >>>> + DRM_ERROR("post reset of IP block <%s> >>> failed %d\n", >>>> + adev->ip_blocks[i].version->funcs->name, r); >>>> + return r; >>>> + } >>>> + } >>>> + >>>> + return r; >>>> +} >>>> + >>>> /** >>>> * amdgpu_device_gpu_recover - reset the asic and recover scheduler >>>> * >>>> @@ -5805,6 +5826,9 @@ int amdgpu_device_gpu_recover(struct >>> amdgpu_device *adev, >>>> amdgpu_put_xgmi_hive(hive); >>>> } >>>> >>>> + if (!r && !job_signaled) >>>> + r = amdgpu_device_ip_post_reset(adev); >>>> + >>>> if (r) >>>> dev_info(adev->dev, "GPU reset end with ret = %d\n", >>>> r); >>>> >>>> diff --git a/drivers/gpu/drm/amd/include/amd_shared.h >>>> b/drivers/gpu/drm/amd/include/amd_shared.h >>>> index b0a6256e89f4..33ce30a8e3ab 100644 >>>> --- a/drivers/gpu/drm/amd/include/amd_shared.h >>>> +++ b/drivers/gpu/drm/amd/include/amd_shared.h >>>> @@ -287,6 +287,7 @@ enum amd_dpm_forced_level; >>>> * @pre_soft_reset: pre soft reset the IP block >>>> * @soft_reset: soft reset the IP block >>>> * @post_soft_reset: post soft reset the IP block >>>> + * @post_reset: handles IP specific post reset stuff(e.g., self >>>> + test) >>>> * @set_clockgating_state: enable/disable cg for the IP block >>>> * @set_powergating_state: enable/disable pg for the IP block >>>> * @get_clockgating_state: get current clockgating status @@ >>>> -316,11 >>>> +317,13 @@ struct amd_ip_funcs { >>>> int (*pre_soft_reset)(void *handle); >>>> int (*soft_reset)(void *handle); >>>> int (*post_soft_reset)(void *handle); >>>> + int (*post_reset)(void *handle); >>>> int (*set_clockgating_state)(void *handle, >>>> enum amd_clockgating_state state); >>>> int (*set_powergating_state)(void *handle, >>>> enum amd_powergating_state state); >>>> void (*get_clockgating_state)(void *handle, u64 *flags); >>>> + >>>> }; >>>> >>>>