Am 12.10.2017 um 11:44 schrieb Nicolai Hähnle: > On 12.10.2017 11:35, Michel Dänzer wrote: >> On 12/10/17 11:23 AM, Christian König wrote: >>> Am 12.10.2017 um 11:10 schrieb Nicolai Hähnle: >>>> On 12.10.2017 10:49, Christian König wrote: >>>>>> However, !guilty && ctx->reset_counter != adev->reset_counter does >>>>>> not imply that the context was lost. >>>>>> >>>>>> The way I understand it, we should return AMDGPU_CTX_INNOCENT_RESET >>>>>> if !guilty && ctx->vram_lost_counter != adev->vram_lost_counter. >>>>>> >>>>>> As far as I understand it, the case of !guilty && ctx->reset_counter >>>>>> != adev->reset_counter && ctx->vram_lost_counter == >>>>>> adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET, because a >>>>>> GPU reset occurred, but it didn't affect our context. >>>>> I disagree on that. >>>>> >>>>> AMDGPU_CTX_INNOCENT_RESET just means what it does currently, there >>>>> was a reset but we haven't been causing it. >>>>> >>>>> That the OpenGL extension is specified otherwise is unfortunate, but >>>>> I think we shouldn't use that for the kernel interface here. >>>> Two counterpoints: >>>> >>>> 1. Why should any application care that there was a reset while it was >>>> idle? The OpenGL behavior is what makes sense. >>> >>> The application is certainly not interest if a reset happened or not, >>> but I though that the driver stack might be. >>> >>>> >>>> 2. AMDGPU_CTX_INNOCENT_RESET doesn't actually mean anything today >>>> because we never return it :) >>>> >>> >>> Good point. >>> >>>> amdgpu_ctx_query only ever returns AMDGPU_CTX_UNKNOWN_RESET, which is >>>> in line with the OpenGL spec: we're conservatively returning that a >>>> reset happened because we don't know whether the context was affected, >>>> and we return UNKNOWN because we also don't know whether the context >>>> was guilty or not. >>>> >>>> Returning AMDGPU_CTX_NO_RESET in the case of !guilty && >>>> ctx->vram_lost_counter == adev->vram_lost_counter is simply a >>>> refinement and improvement of the current, overly conservative >>>> behavior. >>> >>> Ok let's reenumerate what I think the different return values should >>> mean: >>> >>> * AMDGPU_CTX_GUILTY_RESET >>> >>> guilty is set to true for this context. >>> >>> * AMDGPU_CTX_INNOCENT_RESET >>> >>> guilty is not set and vram_lost_counter has changed. >>> >>> * AMDGPU_CTX_UNKNOWN_RESET >>> >>> guilty is not set and vram_lost_counter has not changed, but >>> gpu_reset_counter has changed. >> >> I don't understand the distinction you're proposing between >> AMDGPU_CTX_INNOCENT_RESET and AMDGPU_CTX_UNKNOWN_RESET. I think both >> cases you're describing should return either AMDGPU_CTX_INNOCENT_RESET, >> if the value of guilty is reliable, or AMDGPU_CTX_UNKNOWN_RESET if >> it's not. > > I think it'd make more sense if it was called > "AMDGPU_CTX_UNAFFECTED_RESET". > > So: > - AMDGPU_CTX_GUILTY_RESET --> the context was affected by a reset, and > we know that it's the context's fault > - AMDGPU_CTX_INNOCENT_RESET --> the context was affected by a reset, > and we know that it *wasn't* the context's fault (no context job active) > - AMDGPU_CTX_UNKNOWN_RESET --> the context was affected by a reset, > and we don't know who's responsible (this could be returned in the > unlikely case where context A's gfx job has not yet finished, but > context B's gfx job has already started; it could be the fault of A, > it could be the fault of B -- which somehow manages to hang a part of > the hardware that then prevents A's job from finishing -- or it could > be both; but it's a bit academic) > - AMDGPU_CTX_UNAFFECTED_RESET --> there was a reset, but this context > wasn't affected > > This last value would currently just be discarded by Mesa (because we > should only disturb applications when we have to), but perhaps > somebody else could find it useful? Yes, that's what I had in mind as well. Cause otherwise we would return AMDGPU_CTX_NO_RESET while there actually was a reset and that certainly doesn't sound correct to me. Regards, Christian. > > Cheers, > Nicolai