RE: [PATCH] drm/amdgpu: add post reset IP callback

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[AMD Official Use Only - General]

>-----Original Message-----
>From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>
>Sent: Tuesday, April 2, 2024 9:38 PM
>To: Yu, Lang <Lang.Yu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Koenig, Christian
><Christian.Koenig@xxxxxxx>; Sharma, Shashank
><Shashank.Sharma@xxxxxxx>
>Subject: Re: [PATCH] drm/amdgpu: add post reset IP callback
>
>Am 28.03.24 um 05:40 schrieb Lang Yu:
>> There are use cases which need full GPU functionality (e.g., VM
>> update, TLB inavildate) when doing a GPU reset.
>>
>> Especially, the mes/umsch self tests which help validate the hw state
>> after hw init like ring/ib tests.
>
>I noted that before but just to repeat it once more: We can't do any MES or
>UMSCH validation while doing a GPU reset!

Yes, we can just easily disable it if it doesn't work well.
But it doesn't take too much effort to make it work.
It can expose issues as soon as possible and is useful for debugging purpose.

>The ring and IB tests use some pre-allocated memory we put aside for the
>task during driver load and so can execute during GPU reset as well.

If user space can create a VM and allocate memory during GPU reset,
it makes no sense to prevent kernel space from doing that.

>But for the MES/UMSCH we need a full blown environment with VM and
>submission infrastructure and setting that up isn't possible here.

At least for UMSCH test, it only uses VM mapping functionality
(if we can create a VM with cpu update mode, that's enough),
it doesn't use other submission functionality.
It is actually a compute context.


Regards,
Lang

>Adding Shashank as well, but I think we should probably just completely
>remove those from the kernel.
>
>Regards,
>Christian.
>
>>
>> Add a post reset IP callback to handle such use cases which will be
>> executed after GPU reset succeeds.
>>
>> Signed-off-by: Lang Yu <Lang.Yu@xxxxxxx>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24
>++++++++++++++++++++++
>>   drivers/gpu/drm/amd/include/amd_shared.h   |  3 +++
>>   2 files changed, 27 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 12dc71a6b5db..feeab9397aab 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -5556,6 +5556,27 @@ static int amdgpu_device_health_check(struct
>list_head *device_list_handle)
>>      return ret;
>>   }
>>
>> +static int amdgpu_device_ip_post_reset(struct amdgpu_device *adev) {
>> +    uint32_t i;
>> +    int r;
>> +
>> +    for (i = 0; i < adev->num_ip_blocks; i++) {
>> +            if (!adev->ip_blocks[i].status.valid ||
>> +                !adev->ip_blocks[i].version->funcs->post_reset)
>> +                    continue;
>> +
>> +            r = adev->ip_blocks[i].version->funcs->post_reset(adev);
>> +            if (r) {
>> +                    DRM_ERROR("post reset of IP block <%s>
>failed %d\n",
>> +                              adev->ip_blocks[i].version->funcs->name, r);
>> +                    return r;
>> +            }
>> +    }
>> +
>> +    return r;
>> +}
>> +
>>   /**
>>    * amdgpu_device_gpu_recover - reset the asic and recover scheduler
>>    *
>> @@ -5805,6 +5826,9 @@ int amdgpu_device_gpu_recover(struct
>amdgpu_device *adev,
>>              amdgpu_put_xgmi_hive(hive);
>>      }
>>
>> +    if (!r && !job_signaled)
>> +            r = amdgpu_device_ip_post_reset(adev);
>> +
>>      if (r)
>>              dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
>>
>> diff --git a/drivers/gpu/drm/amd/include/amd_shared.h
>> b/drivers/gpu/drm/amd/include/amd_shared.h
>> index b0a6256e89f4..33ce30a8e3ab 100644
>> --- a/drivers/gpu/drm/amd/include/amd_shared.h
>> +++ b/drivers/gpu/drm/amd/include/amd_shared.h
>> @@ -287,6 +287,7 @@ enum amd_dpm_forced_level;
>>    * @pre_soft_reset: pre soft reset the IP block
>>    * @soft_reset: soft reset the IP block
>>    * @post_soft_reset: post soft reset the IP block
>> + * @post_reset: handles IP specific post reset stuff(e.g., self test)
>>    * @set_clockgating_state: enable/disable cg for the IP block
>>    * @set_powergating_state: enable/disable pg for the IP block
>>    * @get_clockgating_state: get current clockgating status @@ -316,11
>> +317,13 @@ struct amd_ip_funcs {
>>      int (*pre_soft_reset)(void *handle);
>>      int (*soft_reset)(void *handle);
>>      int (*post_soft_reset)(void *handle);
>> +    int (*post_reset)(void *handle);
>>      int (*set_clockgating_state)(void *handle,
>>                                   enum amd_clockgating_state state);
>>      int (*set_powergating_state)(void *handle,
>>                                   enum amd_powergating_state state);
>>      void (*get_clockgating_state)(void *handle, u64 *flags);
>> +
>>   };
>>
>>





[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux