Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

Michel Dänzer <michel@xxxxxxxxxxx> · Mon, 15 Jan 2024 18:09:48 +0100

On 2024-01-15 17:46, Joshua Ashton wrote:
> On 1/15/24 16:34, Michel Dänzer wrote:
>> On 2024-01-15 17:19, Friedrich Vock wrote:
>>> On 15.01.24 16:43, Joshua Ashton wrote:
>>>> On 1/15/24 15:25, Michel Dänzer wrote:
>>>>> On 2024-01-15 14:17, Christian König wrote:
>>>>>> Am 15.01.24 um 12:37 schrieb Joshua Ashton:
>>>>>>> On 1/15/24 09:40, Christian König wrote:
>>>>>>>> Am 13.01.24 um 15:02 schrieb Joshua Ashton:
>>>>>>>>
>>>>>>>>> Without this feedback, the application may keep pushing through
>>>>>>>>> the soft
>>>>>>>>> recoveries, continually hanging the system with jobs that timeout.
>>>>>>>>
>>>>>>>> Well, that is intentional behavior. Marek is voting for making
>>>>>>>> soft recovered errors fatal as well while Michel is voting for
>>>>>>>> better ignoring them.
>>>>>>>>
>>>>>>>> I'm not really sure what to do. If you guys think that soft
>>>>>>>> recovered hangs should be fatal as well then we can certainly do
>>>>>>>> this.
>>>>>
>>>>> A possible compromise might be making soft resets fatal if they
>>>>> happen repeatedly (within a certain period of time?).
>>>>
>>>> No, no and no. Aside from introducing issues by side effects not
>>>> surfacing and all of the stuff I mentioned about descriptor buffers,
>>>> bda, draw indirect and stuff just resulting in more faults and hangs...
>>>>
>>>> You are proposing we throw out every promise we made to an application
>>>> on the API contract level because it "might work". That's just wrong!
>>>>
>>>> Let me put this in explicit terms: What you are proposing is in direct
>>>> violation of the GL and Vulkan specification.
>>>>
>>>> You can't just chose to break these contracts because you think it
>>>> 'might' be a better user experience.
>>>
>>> Is the original issue that motivated soft resets to be non-fatal even an
>>> issue anymore?
>>>
>>> If I read that old thread correctly, the rationale for that was that
>>> assigning guilt to a context was more broken than not doing it, because
>>> the compositor/Xwayland process would also crash despite being unrelated
>>> to the hang.
>>> With Joshua's Mesa fixes, this is not the case anymore, so I don't think
>>> keeping soft resets non-fatal provides any benefit to the user experience.
>>> The potential detriments to user experience have been outlined multiple
>>> times in this thread already.
>>>
>>> (I suppose if the compositor itself faults it might still bring down a
>>> session, but I've literally never seen that, and it's not like a
>>> compositor triggering segfaults on CPU stays alive either.)
>>
>> That's indeed what happened for me, multiple times. And each time the session continued running fine for days after the soft reset.
>>
>> But apparently my experience isn't valid somehow, and I should have been forced to log in again to please the GL gods...
>>
>>
>> Conversely, I can't remember hitting a case where an app kept running into soft resets. It's almost as if different people may have different experiences! ;)
> 
> Your anecdote of whatever application coincidentally managing to soldier through being hung is really not relevant.

But yours is, got it.

> It looks like Mutter already has some handling for GL robustness anyway...

That's just for suspend/resume with the nvidia driver. It can't recover from GPU hangs yet.

-- 
Earthling Michel Dänzer            |                  https://redhat.com
Libre software enthusiast          |         Mesa and Xwayland developer