RE: [PATCH] drm/amdgpu: Clear VCN cache when hw_init

"Chen, Horace" <Horace.Chen@xxxxxxx> · Thu, 6 Jul 2023 08:56:25 +0000

[AMD Official Use Only - General]

Hi Christian & Leo,

Sorry, missed Leo's email.

Currently SR-IOV VF doesn't have suspend/resume use case and this code is in vcn_v4_0_start_sriov() which will only called by SR-IOV code path. So currently we will not hit the suspend/resume issue even with this patch.

About why this patch is necessary:
    Because SR-IOV uses FLR instead of mode 1 reset as the default reset method, and FLR will not reset VRAM during the common reset. So if there is some bad data in the cache, it will cause VF continues to fail after reset.  We have some test case(such as data poison) which may hit this issue.

If want to be safer in case we may add suspend/resume case in the future:
    how about add a "if (amdgpu_in_reset(adev))" check to make sure we only clear cache in the reset domain?

Thanks & Regards,
Horace.

-----Original Message-----
From: Koenig, Christian <Christian.Koenig@xxxxxxx>
Sent: Thursday, July 6, 2023 3:24 PM
To: Liu, Leo <Leo.Liu@xxxxxxx>; Chen, Horace <Horace.Chen@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Quan, Evan <Evan.Quan@xxxxxxx>; Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Xiao, Jack <Jack.Xiao@xxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Liu, Monk <Monk.Liu@xxxxxxx>; Xu, Feifei <Feifei.Xu@xxxxxxx>; Chang, HaiJun <HaiJun.Chang@xxxxxxx>
Subject: Re: [PATCH] drm/amdgpu: Clear VCN cache when hw_init

Since this patch was already pushed please revert it until we have a technical consent on why this should be necessary.

Regards,
Christian.

Am 05.07.23 um 21:57 schrieb Leo Liu:
> What Christian says is correct, esp. during the playback or encode,
> when suspend/resume happens, it will save the FW context, and after
> resume, it will continue the job to where it left during the suspend.
> Will this apply to SRIOV case? Since the changes only within the SRIOV
> code, please make sure that also please specify the SRIOV from your
> patch subject and commit message.
>
> Regards,
>
> Leo
>
>
> On 2023-06-30 07:38, Christian König wrote:
>> Am 20.06.23 um 15:29 schrieb Horace Chen:
>>> [Why]
>>> VCN will use some framebuffer space as its cache. It needs to be
>>> reset when reset happens, such as FLR. Otherwise some error may be
>>> kept after the reset.
>>
>> Well this doesn't make sense at all.
>>
>> The full content of adev->vcn.inst[i].cpu_addr is saved and restored
>> during suspend/resume and IIRC GPU resets as well.
>>
>> See functions amdgpu_vcn_suspend() and amdgpu_vcn_resume().
>>
>> Please let Leo's team take a look at this and review the change
>> before it is committed.
>>
>> Regards,
>> Christian.
>>
>>>
>>> Signed-off-by: Horace Chen <horace.chen@xxxxxxx>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c | 3 +++
>>>   1 file changed, 3 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
>>> b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
>>> index b48bb5212488..2db73a964031 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
>>> @@ -1292,6 +1292,7 @@ static int vcn_v4_0_start_sriov(struct
>>> amdgpu_device *adev)
>>>               cache_size);
>>>             cache_addr = adev->vcn.inst[i].gpu_addr + offset;
>>> +        memset(adev->vcn.inst[i].cpu_addr + offset, 0,
>>> AMDGPU_VCN_STACK_SIZE);
>>>           MMSCH_V4_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(VCN, i,
>>>               regUVD_LMI_VCPU_CACHE1_64BIT_BAR_LOW),
>>>               lower_32_bits(cache_addr)); @@ -1307,6 +1308,8 @@
>>> static int vcn_v4_0_start_sriov(struct amdgpu_device *adev)
>>>             cache_addr = adev->vcn.inst[i].gpu_addr + offset +
>>>               AMDGPU_VCN_STACK_SIZE;
>>> +        memset(adev->vcn.inst[i].cpu_addr + offset +
>>> AMDGPU_VCN_STACK_SIZE, 0,
>>> +            AMDGPU_VCN_STACK_SIZE);
>>>           MMSCH_V4_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(VCN, i,
>>>               regUVD_LMI_VCPU_CACHE2_64BIT_BAR_LOW),
>>>               lower_32_bits(cache_addr));
>>