Re: [PATCH 1/2] drm/amdgpu: Reset IH OVERFLOW_CLEAR bit after writing rptr

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 1/22/24 13:35, Christian König wrote:
Am 22.01.24 um 11:45 schrieb Friedrich Vock:
On 22.01.24 11:21, Friedrich Vock wrote:
On 22.01.24 11:10, Christian König wrote:
Am 19.01.24 um 20:18 schrieb Felix Kuehling:
On 2024-01-18 07:07, Christian König wrote:
Am 18.01.24 um 00:44 schrieb Friedrich Vock:
On 18.01.24 00:00, Alex Deucher wrote:
[SNIP]
No, amdgpu.noretry=1 does not change anything.

Well the good news first the hw engineer answered rather quickly.
The bad news is that the hardware really doesn't work as documented
in multiple ways.

First of all the CLEAR bit is a level and not a trigger, so the
intention to clear it is indeed correct. For now please modify this
patch so that the CLEAR bit is set and cleared directly after
setting it, this way we should be able to detect further overflows
immediately.

Then the APU the Steam Deck uses simply doesn't have the filter
function for page faults in the hardware, the really bad news is it
also doesn't have the extra IH rings where we could re-route the
faults to prevent overflows.

That full explains the behavior you have been seeing, but doesn't
really provide a doable solution to mitigate this problem.

I'm going to dig deeper into the hw documentation and specification
to see if we can use a different feature to avoid the overflow.

If we're not enabling retry faults, then each wave front should
generate at most one fault. You should be able to avoid overflows by
making the IH ring large enough to accommodate one fault per wave
front.

That is the exact same argument our HW engineers came up with when we
asked why the APU is missing all those nice IH ring overflow avoidance
features the dGPUs have :)

I can reproduce IH overflows on my RX 6700 XT dGPU as well FWIW.

Interesting data point. We have probably looked to much into the faults on MI* products and never checked Navi.

Can you try to just setting WPTR_OVERFLOW_ENABLE to 0? At least in theory that should disable IH overflows altogether on Navi without causing loss of IVs.


The only problem with this approach is that on Navi when a wave is
blocked by waiting on a fault you can't kill it using soft recovery
any more (at least when my understanding is correct).

Killing page-faulted waves via soft recovery works. From my testing on
Deck, it seems to take a bit of time, but if you try for long enough
soft recovery eventually succeeds.

Ok that is massively strange. We had tons of discussions about that shader can't be interrupted while they wait for a fault on Navi.

Maybe killing them is still possible, need to double check that as well.



On second thought, could it be that this is the critical flaw in the "at
most one fault per wave" thinking?

Well completely agree that this. That rational to leave out the new IH features on APUs is rather weak.


Most work submissions in practice submit more waves than the number of
wave slots the GPU has.
As far as I understand soft recovery, the only thing it does is kill all
active waves. This frees up the CUs so more waves are launched, which
can fault again, and that leads to potentially lots of faults for a
single wave slot in the end.

Exactly that, but killing each wave takes a moment since we do that in a loop with a bit delay in there.

So the interrupt handler should at least in theory have time to catch up.

I don't think there is any delay in that loop is there?

	while (!dma_fence_is_signaled(fence) &&
	       ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
		ring->funcs->soft_recovery(ring, vmid);

(soft_recovery function does not have a delay/sleep/whatever either)

FWIW, two other changes we did in SteamOS to make recovery more reliable on VANGOGH was:

1) Move the timeout determination after the spinlock setting the fence error.

2) Raise the timeout from 0.1s to 1s.

- Joshie 🐸✨



Regards,
Christian.


Regards,
Friedrich




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux