Am 2021-11-11 um 8:43 a.m. schrieb Christian König: > Am 11.11.21 um 13:13 schrieb Felix Kuehling: >> Am 2021-11-11 um 2:00 a.m. schrieb Christian König: >>> Am 11.11.21 um 00:36 schrieb Felix Kuehling: >>>> On 2021-11-10 9:31 a.m., Christian König wrote: >>>> [SNIP] >>>> Aren't we processing interrupts out-of-order in this case. We're >>>> processing newer ones before older ones. Is that the root of the >>>> problem because it confuses our interrupt draining function? >>> Good point. >>> >>>> Maybe we need to detect overflows in the interrupt draining function >>>> to make it wait longer in that case. >>> Ideally we should use something which is completely separate from all >>> those implementation details. >>> >>> Like for example using the timestamp or a separate indicator/counter >>> instead. >> Even a timestamp will be broken if the interrupt processing function >> handles interrupts out of order. > > We can easily detect if the timestamp is going backwards and so filter > out stale entries. > >> I think we need a different amdgpu_ih_process function for IH ring 1 the >> way we set it up to handle overflows. Because IH is just overwriting >> older entries, and we can't read entire IH entries atomically, we have >> to use a watermark. If IH WPTR passes the watermark, we have to consider >> the ring overflowed and reset our RPTR. We have to set RPTR far enough >> "ahead" of the current WPTR to make sure WPTR is under the watermark. >> And the watermark needs to be low enough to ensure amdgpu_irq_dispatch >> can read out the next IH entry before the WPTR catches up with the RPTR. >> >> Since we don't read the WPTR on every iteration, and out page fault >> handling code can take quite a while to process one fault, the watermark >> needs to provide a lot of buffer. Maybe we also need to read the WPTR >> register more frequently if the last read was more than a jiffy ago. > > I think trying to solve that with the IH code or hardware is the > completely wrong approach. > > As I said before we need to something more robust and using the > timestamp sounds like the most logical approach to me. > > The only alternative I can see would be to have a hardware assisted > flag which tells you if you had an overflow or not like we have for IH > ring 0. The *_ih_get_wptr functions already do some overflow handling. I think we'll need a function to read the overflow bit that amdgpu_ih_process can call separately, after reading IH entries. > > E.g. something like the following: > 1. Copy the next N IV from the RPTR location. > 2. Get the current WPTR. > 3. If the overflow bit in the WPTR is set update the RPTR to something > like WPTR+window, clear the overflow bit and repeat at #1. > 4. Process the valid IVs. OK. Current amdgpu_irq_dispatch reads directly from the IH ring. I think you're proposing to move reading of the ring into amdgpu_ih_process where we can discard the entries if an overflow is detected. Then let amdgpu_irq_dispatch use a copy that's guaranteed to be consistent. > > The down side is that we are loosing a lot of IVs with that. That is > probably ok for the current approach, but most likely a bad idea if we > enable the CAM. Right. Once we use the CAM we cannot afford to lose faults. If we do, we need to clear the CAM. Regards, Felix > > Regards, > Christian. > >> >> Whenever an overflow (over the watermark) is detected, we can set a >> sticky overflow bit that our page fault handling code can use to clean >> up. E.g. once we start using the CAM filter, we'll have to invalidate >> all CAM entries when this happens (although I'd expect overflows to >> become impossible once we enable the CAM filter). >> >> Thanks, >> Felix >> >