RE: [PATCH 5/5] drm/amdgpu: reduce reset time

"Zhao, Victor" <Victor.Zhao@xxxxxxx> · Wed, 27 Jul 2022 10:35:19 +0000

[AMD Official Use Only - General]

Hi Andrey,

Problem with status.hang is that it is set at amdgpu_device_ip_check_soft_reset, which is not implemented in nv or gfx10. They have to be nicely implemented first.
Another option I thought is to mark status.hang or add a flag to amdgpu_gfx when job timeout reported on gfx/comp ring. And this will require some logic to map the relationship for ring and ip blocks. This way does not look good as well.

Thanks,
Victor

-----Original Message-----
From: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx> 
Sent: Wednesday, July 27, 2022 12:57 AM
To: Zhao, Victor <Victor.Zhao@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Deng, Emily <Emily.Deng@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>
Subject: Re: [PATCH 5/5] drm/amdgpu: reduce reset time

On 2022-07-26 05:40, Zhao, Victor wrote:
> [AMD Official Use Only - General]
>
> Hi Andrey,
>
> Reply inline.
>
>
> Thanks,
> Victor
>
>
>
> -----Original Message-----
> From: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>
> Sent: Tuesday, July 26, 2022 5:18 AM
> To: Zhao, Victor <Victor.Zhao@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Deng, Emily 
> <Emily.Deng@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>
> Subject: Re: [PATCH 5/5] drm/amdgpu: reduce reset time
>
>
> On 2022-07-22 03:34, Victor Zhao wrote:
>> In multi container use case, reset time is important, so skip ring 
>> tests and cp halt wait during ip suspending for reset as they are 
>> going to fail and cost more time on reset
>
> Why are they failing in this case ? Skipping ring tests is not the best idea as you loose important indicator of system's sanity. Is there any way to make them work ?
>
> [Victor]: I've seen gfx ring test fail every time after a gfx engine hang. I thought it should be expected as gfx is in a bad state. Do you know the reason we have ring tests before reset? As we are going to reset the asic anyway.
> Another approach could be to make the skip mode2 only or reduce the wait time here.

I dug down in history and according to commit 'drm/amdgpu:unmap KCQ in gfx hw_fini(v2)' you need to write to scratch register for completion of queue unmap operation so you defently don't want to just skip it. I agree in case that the ring is hung this has no point but remember that GPU reset can happen not only to a hunged ring but for other reasons (RAS, manual reset e.t.c.) in which case you probably want to shut down gracefully here ?
I see we have adev->ip_blocks[i].status.hang flag which you maybe can use here instead ?

>
>
>> Signed-off-by: Victor Zhao <Victor.Zhao@xxxxxxx>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c |  2 +-
>>    drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c  | 26 +++++++++++++++----------
>>    2 files changed, 17 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> index 222d3d7ea076..f872495ccc3a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> @@ -477,7 +477,7 @@ int amdgpu_gfx_disable_kcq(struct amdgpu_device *adev)
>>    		kiq->pmf->kiq_unmap_queues(kiq_ring, &adev->gfx.compute_ring[i],
>>    					   RESET_QUEUES, 0, 0);
>>    
>> -	if (adev->gfx.kiq.ring.sched.ready)
>> +	if (adev->gfx.kiq.ring.sched.ready && !amdgpu_in_reset(adev))
>>    		r = amdgpu_ring_test_helper(kiq_ring);
>>    	spin_unlock(&adev->gfx.kiq.ring_lock);
>>    
>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>> b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>> index fafbad3cf08d..9ae29023e38f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>> @@ -5971,16 +5971,19 @@ static int gfx_v10_0_cp_gfx_enable(struct amdgpu_device *adev, bool enable)
>>    		WREG32_SOC15(GC, 0, mmCP_ME_CNTL, tmp);
>>    	}
>>    
>> -	for (i = 0; i < adev->usec_timeout; i++) {
>> -		if (RREG32_SOC15(GC, 0, mmCP_STAT) == 0)
>> -			break;
>> -		udelay(1);
>> -	}
>> -
>> -	if (i >= adev->usec_timeout)
>> -		DRM_ERROR("failed to %s cp gfx\n", enable ? "unhalt" : "halt");
>> +	if (!amdgpu_in_reset(adev)) {
>> +		for (i = 0; i < adev->usec_timeout; i++) {
>> +			if (RREG32_SOC15(GC, 0, mmCP_STAT) == 0)
>> +				break;
>> +			udelay(1);
>> +		}
>>    
>> +		if (i >= adev->usec_timeout)
>> +			DRM_ERROR("failed to %s cp gfx\n",
>> +				  enable ? "unhalt" : "halt");
>> +	}
>>    	return 0;
>> +
>>    }
>
> This change has impact beyond container case no ? We had no issue with this code during regular reset cases so why we would give up on this code which confirms CP is idle ? What is the side effect of skipping this during all GPU resets ?
>
> Andrey
>
> [Victor]: I see "failed to halt cp gfx" with regular reset cases as well when doing a gfx hang test using quark. I haven't seen a side effect with Mode1 reset yet but maybe shorten the wait time could be better?

Same as above i guess, it would indeed time out for a hung ring but GPU reset happens not only because of hung rings but for other reasons.

Andrey

>
>>    
>>    static int gfx_v10_0_cp_gfx_load_pfp_microcode(struct 
>> amdgpu_device
>> *adev) @@ -7569,8 +7572,10 @@ static int gfx_v10_0_kiq_disable_kgq(struct amdgpu_device *adev)
>>    	for (i = 0; i < adev->gfx.num_gfx_rings; i++)
>>    		kiq->pmf->kiq_unmap_queues(kiq_ring, &adev->gfx.gfx_ring[i],
>>    					   PREEMPT_QUEUES, 0, 0);
>> -
>> -	return amdgpu_ring_test_helper(kiq_ring);
>> +	if (!amdgpu_in_reset(adev))
>> +		return amdgpu_ring_test_helper(kiq_ring);
>> +	else
>> +		return 0;
>>    }
>>    #endif
>>    
>> @@ -7610,6 +7615,7 @@ static int gfx_v10_0_hw_fini(void *handle)
>>    
>>    		return 0;
>>    	}
>> +
>>    	gfx_v10_0_cp_enable(adev, false);
>>    	gfx_v10_0_enable_gui_idle_interrupt(adev, false);
>>