Re: [PATCH 1/5] drm/amdgpu: handle IH ring1 overflow

philip yang <yangp@xxxxxxx> · Thu, 11 Nov 2021 16:31:20 -0500

    On 2021-11-11 8:57 a.m., Felix Kuehling
      wrote:

      Am 2021-11-11 um 8:43 a.m. schrieb Christian König:

        Am 11.11.21 um 13:13 schrieb Felix Kuehling:

          Am 2021-11-11 um 2:00 a.m. schrieb Christian König:

            Am 11.11.21 um 00:36 schrieb Felix Kuehling:

              On 2021-11-10 9:31 a.m., Christian König wrote:
[SNIP]
Aren't we processing interrupts out-of-order in this case. We're
processing newer ones before older ones. Is that the root of the
problem because it confuses our interrupt draining function?

            Good point.

              Maybe we need to detect overflows in the interrupt draining function
to make it wait longer in that case.

            Ideally we should use something which is completely separate from all
those implementation details.

Like for example using the timestamp or a separate indicator/counter
instead.

          Even a timestamp will be broken if the interrupt processing function
handles interrupts out of order.

        We can easily detect if the timestamp is going backwards and so filter
out stale entries.

          I think we need a different amdgpu_ih_process function for IH ring 1 the
way we set it up to handle overflows. Because IH is just overwriting
older entries, and we can't read entire IH entries atomically, we have
to use a watermark. If IH WPTR passes the watermark, we have to consider
the ring overflowed and reset our RPTR. We have to set RPTR far enough
"ahead" of the current WPTR to make sure WPTR is under the watermark.
And the watermark needs to be low enough to ensure amdgpu_irq_dispatch
can read out the next IH entry before the WPTR catches up with the RPTR.

Since we don't read the WPTR on every iteration, and out page fault
handling code can take quite a while to process one fault, the watermark
needs to provide a lot of buffer. Maybe we also need to read the WPTR
register more frequently if the last read was more than a jiffy ago.

        I think trying to solve that with the IH code or hardware is the
completely wrong approach.

As I said before we need to something more robust and using the
timestamp sounds like the most logical approach to me.

The only alternative I can see would be to have a hardware assisted
flag which tells you if you had an overflow or not like we have for IH
ring 0.

      The *_ih_get_wptr functions already do some overflow handling. I think
we'll need a function to read the overflow bit that amdgpu_ih_process
can call separately, after reading IH entries.

    Tried to increase ring1 buf size from 4KB to 256KB, overflow
      still happens, seems watermark is not feasible as recover fault
      takes longer period sometime. We already have 48bit timestamp in
      IV, I will try use it to check overflow, and update rptr to try
      catch up

        E.g. something like the following:
1. Copy the next N IV from the RPTR location.
2. Get the current WPTR.
3. If the overflow bit in the WPTR is set update the RPTR to something
like WPTR+window, clear the overflow bit and repeat at #1.
4. Process the valid IVs.

      OK. Current amdgpu_irq_dispatch reads directly from the IH ring. I think
you're proposing to move reading of the ring into amdgpu_ih_process
where we can discard the entries if an overflow is detected.

Then let amdgpu_irq_dispatch use a copy that's guaranteed to be consistent.

    In amdgpu_ih_process (may add new function for ring1), after
      reading wptr, check if wptr overflow and update rptr

    if (ring[rptr - 1].timestamp > ring[rptr].timestamp)
         rptr = wptr + 1
    This may still process retry fault out of order, but drain fault
      will finish correctly with condition rptr >= checkpoint_wptr,
      we will not process stale fault after range is freed.

    Regards,
    Philip

        The down side is that we are loosing a lot of IVs with that. That is
probably ok for the current approach, but most likely a bad idea if we
enable the CAM.

      Right. Once we use the CAM we cannot afford to lose faults. If we do, we
need to clear the CAM.

Regards,
  Felix

        Regards,
Christian.

          Whenever an overflow (over the watermark) is detected, we can set a
sticky overflow bit that our page fault handling code can use to clean
up. E.g. once we start using the CAM filter, we'll have to invalidate
all CAM entries when this happens (although I'd expect overflows to
become impossible once we enable the CAM filter).

Thanks,
   Felix