Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

Christian König <christian.koenig@xxxxxxx> · Tue, 16 Jan 2024 16:05:45 +0100

Am 16.01.24 um 14:48 schrieb Joshua Ashton:

[SNIP]
Going to adjust the implementation accordingly.

Awesome, please CC me know when you have something.

Sure, going to keep that in mind.

In the short-term I have changed if (r && r != -ENODATA) to if (r) 
and that seems to work nicely for me.

One problem with solely relying on the CS submission return value 
from userspace is cancelled syncobj waits.

For example, if we have an application with one thread that makes a 
submission, and then kicks off a vkWaitSemaphores to wait on a 
semaphore on another thread and that submission hangs, the syncobj 
relating to the vkWaitSemaphores should be signalled which is fine, 
but we need to return VK_ERROR_DEVICE_LOST if the context loss 
resulted in the signal for the VkSemaphore.

The way this was previously integrated was with the query thing, 
which as we have established does not provide correct information 
regarding soft recovery at the moment.

Unless you have an alternative for us to get some error out of the 
syncobj (eg. -ENODATA), then right now we still require the query.

I think fixing the -ENODATA reporting back for submit is a good 
step, but I believe we still need the query to report the same 
information as we would have gotten from that here.

Yeah, exactly that was one of the reasons why the guilty handling 
approach didn't solved the problem sufficiently.

If I remember correctly at least for the syncobj used on Android you can 
actually query the result of the execution after waiting for them to signal.

So not only the issuer of a submission can get the result, but also 
everybody waiting for the result. In other words Wayland, X, etc... can 
implement graceful handling should an application send them nonsense.

I don't think we ever implemented something similar for drm_syncobj and 
we might need error forwarding in the dma_fence_chain container, but 
stuff like that is trivial to implement should the requirement for this 
ever come up.

Hmmm, actually the spec states that VK_SUCCESS is valid in this 
situation:

Commands that wait indefinitely for device execution (namely 
vkDeviceWaitIdle, vkQueueWaitIdle, vkWaitForFences with a maximum 
timeout, and vkGetQueryPoolResults with the VK_QUERY_RESULT_WAIT_BIT 
bit set in flags) must return in finite time even in the case of a 
lost device, and return either VK_SUCCESS or VK_ERROR_DEVICE_LOST.

...

Once a device is lost, command execution may fail, and certain 
commands that return a VkResult may return VK_ERROR_DEVICE_LOST.

I guess for now disregard last email regarding us potentially needing 
the query, it does seem that returning SUCCESS is completely valid.

Well how you interpret the information the kernel gives you in userspace 
is your thing. But we should at least make sure that the general 
approach inside the kernel has a design which can handle those requirements.

Christian.

- Joshie 🐸✨

- Joshie 🐸✨

Thanks

- Joshie 🐸✨

- Joshie 🐸✨

Christian.

Marek