Re: [PATCH 1/5] drm/amdgpu: handle IH ring1 overflow

Christian König <ckoenig.leichtzumerken@xxxxxxxxx> · Thu, 11 Nov 2021 14:43:01 +0100

Am 11.11.21 um 13:13 schrieb Felix Kuehling:
Am 2021-11-11 um 2:00 a.m. schrieb Christian König:
Am 11.11.21 um 00:36 schrieb Felix Kuehling:
On 2021-11-10 9:31 a.m., Christian König wrote:
[SNIP]
Aren't we processing interrupts out-of-order in this case. We're
processing newer ones before older ones. Is that the root of the
problem because it confuses our interrupt draining function?
Good point.

Maybe we need to detect overflows in the interrupt draining function
to make it wait longer in that case.
Ideally we should use something which is completely separate from all
those implementation details.

Like for example using the timestamp or a separate indicator/counter
instead.
Even a timestamp will be broken if the interrupt processing function
handles interrupts out of order.

We can easily detect if the timestamp is going backwards and so filter 
out stale entries.

I think we need a different amdgpu_ih_process function for IH ring 1 the
way we set it up to handle overflows. Because IH is just overwriting
older entries, and we can't read entire IH entries atomically, we have
to use a watermark. If IH WPTR passes the watermark, we have to consider
the ring overflowed and reset our RPTR. We have to set RPTR far enough
"ahead" of the current WPTR to make sure WPTR is under the watermark.
And the watermark needs to be low enough to ensure amdgpu_irq_dispatch
can read out the next IH entry before the WPTR catches up with the RPTR.

Since we don't read the WPTR on every iteration, and out page fault
handling code can take quite a while to process one fault, the watermark
needs to provide a lot of buffer. Maybe we also need to read the WPTR
register more frequently if the last read was more than a jiffy ago.

I think trying to solve that with the IH code or hardware is the 
completely wrong approach.

As I said before we need to something more robust and using the 
timestamp sounds like the most logical approach to me.

The only alternative I can see would be to have a hardware assisted flag 
which tells you if you had an overflow or not like we have for IH ring 0.

E.g. something like the following:
1. Copy the next N IV from the RPTR location.
2. Get the current WPTR.
3. If the overflow bit in the WPTR is set update the RPTR to something 
like WPTR+window, clear the overflow bit and repeat at #1.
4. Process the valid IVs.

The down side is that we are loosing a lot of IVs with that. That is 
probably ok for the current approach, but most likely a bad idea if we 
enable the CAM.

Regards,
Christian.

Whenever an overflow (over the watermark) is detected, we can set a
sticky overflow bit that our page fault handling code can use to clean
up. E.g. once we start using the CAM filter, we'll have to invalidate
all CAM entries when this happens (although I'd expect overflows to
become impossible once we enable the CAM filter).

Thanks,
   Felix