RE: [PATCH] drm/amdgpu: add post reset IP callback

"Yu, Lang" <Lang.Yu@xxxxxxx> · Wed, 3 Apr 2024 07:31:04 +0000

[AMD Official Use Only - General]

>-----Original Message-----
>From: Sharma, Shashank <Shashank.Sharma@xxxxxxx>
>Sent: Wednesday, April 3, 2024 3:19 PM
>To: Yu, Lang <Lang.Yu@xxxxxxx>; Christian König
><ckoenig.leichtzumerken@xxxxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Koenig, Christian
><Christian.Koenig@xxxxxxx>
>Subject: Re: [PATCH] drm/amdgpu: add post reset IP callback
>
>Hey Lang,
>
>On 03/04/2024 08:51, Yu, Lang wrote:
>> [AMD Official Use Only - General]
>>
>>> -----Original Message-----
>>> From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>
>>> Sent: Tuesday, April 2, 2024 9:38 PM
>>> To: Yu, Lang <Lang.Yu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Koenig, Christian
>>> <Christian.Koenig@xxxxxxx>; Sharma, Shashank
>>> <Shashank.Sharma@xxxxxxx>
>>> Subject: Re: [PATCH] drm/amdgpu: add post reset IP callback
>>>
>>> Am 28.03.24 um 05:40 schrieb Lang Yu:
>>>> There are use cases which need full GPU functionality (e.g., VM
>>>> update, TLB inavildate) when doing a GPU reset.
>>>>
>>>> Especially, the mes/umsch self tests which help validate the hw
>>>> state after hw init like ring/ib tests.
>>> I noted that before but just to repeat it once more: We can't do any
>>> MES or UMSCH validation while doing a GPU reset!
>> Yes, we can just easily disable it if it doesn't work well.
>> But it doesn't take too much effort to make it work.
>> It can expose issues as soon as possible and is useful for debugging
>purpose.
>IMO, its not that useful for debugging as well. In case of a problem, It will just
>timeout waiting for MES packet write and we will still have to find out the
>actual problem which caused MES to go into bad state in the last GPU reset.
>>
>>> The ring and IB tests use some pre-allocated memory we put aside for
>>> the task during driver load and so can execute during GPU reset as well.
>> If user space can create a VM and allocate memory during GPU reset, it
>> makes no sense to prevent kernel space from doing that.
>
>I think the objection here is mostly about why to do it at all, when it is not
>helpful. It would be just a maintenance overhead.

If you think it is not helpful,  why doing ring/ib tests?
I don't think such sanity test is not helpful.

I only talk UMSCH test(different with MES test) here,
I don't think it has a maintenance overhead.

Regards,
Lang

>- Shashank
>
>>> But for the MES/UMSCH we need a full blown environment with VM and
>>> submission infrastructure and setting that up isn't possible here.
>> At least for UMSCH test, it only uses VM mapping functionality (if we
>> can create a VM with cpu update mode, that's enough), it doesn't use
>> other submission functionality.
>> It is actually a compute context.
>>
>>
>> Regards,
>> Lang
>>
>>> Adding Shashank as well, but I think we should probably just
>>> completely remove those from the kernel.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> Add a post reset IP callback to handle such use cases which will be
>>>> executed after GPU reset succeeds.
>>>>
>>>> Signed-off-by: Lang Yu <Lang.Yu@xxxxxxx>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24
>>> ++++++++++++++++++++++
>>>>    drivers/gpu/drm/amd/include/amd_shared.h   |  3 +++
>>>>    2 files changed, 27 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index 12dc71a6b5db..feeab9397aab 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -5556,6 +5556,27 @@ static int
>amdgpu_device_health_check(struct
>>> list_head *device_list_handle)
>>>>       return ret;
>>>>    }
>>>>
>>>> +static int amdgpu_device_ip_post_reset(struct amdgpu_device *adev) {
>>>> +    uint32_t i;
>>>> +    int r;
>>>> +
>>>> +    for (i = 0; i < adev->num_ip_blocks; i++) {
>>>> +            if (!adev->ip_blocks[i].status.valid ||
>>>> +                !adev->ip_blocks[i].version->funcs->post_reset)
>>>> +                    continue;
>>>> +
>>>> +            r = adev->ip_blocks[i].version->funcs->post_reset(adev);
>>>> +            if (r) {
>>>> +                    DRM_ERROR("post reset of IP block <%s>
>>> failed %d\n",
>>>> +                              adev->ip_blocks[i].version->funcs->name, r);
>>>> +                    return r;
>>>> +            }
>>>> +    }
>>>> +
>>>> +    return r;
>>>> +}
>>>> +
>>>>    /**
>>>>     * amdgpu_device_gpu_recover - reset the asic and recover scheduler
>>>>     *
>>>> @@ -5805,6 +5826,9 @@ int amdgpu_device_gpu_recover(struct
>>> amdgpu_device *adev,
>>>>               amdgpu_put_xgmi_hive(hive);
>>>>       }
>>>>
>>>> +    if (!r && !job_signaled)
>>>> +            r = amdgpu_device_ip_post_reset(adev);
>>>> +
>>>>       if (r)
>>>>               dev_info(adev->dev, "GPU reset end with ret = %d\n",
>>>> r);
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/include/amd_shared.h
>>>> b/drivers/gpu/drm/amd/include/amd_shared.h
>>>> index b0a6256e89f4..33ce30a8e3ab 100644
>>>> --- a/drivers/gpu/drm/amd/include/amd_shared.h
>>>> +++ b/drivers/gpu/drm/amd/include/amd_shared.h
>>>> @@ -287,6 +287,7 @@ enum amd_dpm_forced_level;
>>>>     * @pre_soft_reset: pre soft reset the IP block
>>>>     * @soft_reset: soft reset the IP block
>>>>     * @post_soft_reset: post soft reset the IP block
>>>> + * @post_reset: handles IP specific post reset stuff(e.g., self
>>>> + test)
>>>>     * @set_clockgating_state: enable/disable cg for the IP block
>>>>     * @set_powergating_state: enable/disable pg for the IP block
>>>>     * @get_clockgating_state: get current clockgating status @@
>>>> -316,11
>>>> +317,13 @@ struct amd_ip_funcs {
>>>>       int (*pre_soft_reset)(void *handle);
>>>>       int (*soft_reset)(void *handle);
>>>>       int (*post_soft_reset)(void *handle);
>>>> +    int (*post_reset)(void *handle);
>>>>       int (*set_clockgating_state)(void *handle,
>>>>                                    enum amd_clockgating_state state);
>>>>       int (*set_powergating_state)(void *handle,
>>>>                                    enum amd_powergating_state state);
>>>>       void (*get_clockgating_state)(void *handle, u64 *flags);
>>>> +
>>>>    };
>>>>
>>>>