TDR and VRAM lost handling in KMD (v2)

christian.koenig@xxxxxxx (Christian König) · Thu, 12 Oct 2017 12:13:05 +0200

Am 12.10.2017 um 11:44 schrieb Nicolai HÃ¤hnle:
> On 12.10.2017 11:35, Michel DÃ¤nzer wrote:
>> On 12/10/17 11:23 AM, Christian KÃ¶nig wrote:
>>> Am 12.10.2017 um 11:10 schrieb Nicolai HÃ¤hnle:
>>>> On 12.10.2017 10:49, Christian KÃ¶nig wrote:
>>>>>> However, !guilty && ctx->reset_counter != adev->reset_counter does
>>>>>> not imply that the context was lost.
>>>>>>
>>>>>> The way I understand it, we should return AMDGPU_CTX_INNOCENT_RESET
>>>>>> if !guilty && ctx->vram_lost_counter != adev->vram_lost_counter.
>>>>>>
>>>>>> As far as I understand it, the case of !guilty && ctx->reset_counter
>>>>>> != adev->reset_counter && ctx->vram_lost_counter ==
>>>>>> adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET, because a
>>>>>> GPU reset occurred, but it didn't affect our context.
>>>>> I disagree on that.
>>>>>
>>>>> AMDGPU_CTX_INNOCENT_RESET just means what it does currently, there
>>>>> was a reset but we haven't been causing it.
>>>>>
>>>>> That the OpenGL extension is specified otherwise is unfortunate, but
>>>>> I think we shouldn't use that for the kernel interface here.
>>>> Two counterpoints:
>>>>
>>>> 1. Why should any application care that there was a reset while it was
>>>> idle? The OpenGL behavior is what makes sense.
>>>
>>> The application is certainly not interest if a reset happened or not,
>>> but I though that the driver stack might be.
>>>
>>>>
>>>> 2. AMDGPU_CTX_INNOCENT_RESET doesn't actually mean anything today
>>>> because we never return it :)
>>>>
>>>
>>> Good point.
>>>
>>>> amdgpu_ctx_query only ever returns AMDGPU_CTX_UNKNOWN_RESET, which is
>>>> in line with the OpenGL spec: we're conservatively returning that a
>>>> reset happened because we don't know whether the context was affected,
>>>> and we return UNKNOWN because we also don't know whether the context
>>>> was guilty or not.
>>>>
>>>> Returning AMDGPU_CTX_NO_RESET in the case of !guilty &&
>>>> ctx->vram_lost_counter == adev->vram_lost_counter is simply a
>>>> refinement and improvement of the current, overly conservative 
>>>> behavior.
>>>
>>> Ok let's reenumerate what I think the different return values should 
>>> mean:
>>>
>>> * AMDGPU_CTX_GUILTY_RESET
>>>
>>> guilty is set to true for this context.
>>>
>>> * AMDGPU_CTX_INNOCENT_RESET
>>>
>>> guilty is not set and vram_lost_counter has changed.
>>>
>>> * AMDGPU_CTX_UNKNOWN_RESET
>>>
>>> guilty is not set and vram_lost_counter has not changed, but
>>> gpu_reset_counter has changed.
>>
>> I don't understand the distinction you're proposing between
>> AMDGPU_CTX_INNOCENT_RESET and AMDGPU_CTX_UNKNOWN_RESET. I think both
>> cases you're describing should return either AMDGPU_CTX_INNOCENT_RESET,
>> if the value of guilty is reliable, or AMDGPU_CTX_UNKNOWN_RESET if 
>> it's not.
>
> I think it'd make more sense if it was called 
> "AMDGPU_CTX_UNAFFECTED_RESET".
>
> So:
> - AMDGPU_CTX_GUILTY_RESET --> the context was affected by a reset, and 
> we know that it's the context's fault
> - AMDGPU_CTX_INNOCENT_RESET --> the context was affected by a reset, 
> and we know that it *wasn't* the context's fault (no context job active)
> - AMDGPU_CTX_UNKNOWN_RESET --> the context was affected by a reset, 
> and we don't know who's responsible (this could be returned in the 
> unlikely case where context A's gfx job has not yet finished, but 
> context B's gfx job has already started; it could be the fault of A, 
> it could be the fault of B -- which somehow manages to hang a part of 
> the hardware that then prevents A's job from finishing -- or it could 
> be both; but it's a bit academic)
> - AMDGPU_CTX_UNAFFECTED_RESET --> there was a reset, but this context 
> wasn't affected
>
> This last value would currently just be discarded by Mesa (because we 
> should only disturb applications when we have to), but perhaps 
> somebody else could find it useful?

Yes, that's what I had in mind as well.

Cause otherwise we would return AMDGPU_CTX_NO_RESET while there actually 
was a reset and that certainly doesn't sound correct to me.

Regards,
Christian.

>
> Cheers,
> Nicolai