Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes: > On Tigerlake, we are seeing a repeat of commit d8f505311717 ("drm/i915/icl: > Forcibly evict stale csb entries") where, presumably, due to a missing > Global Observation Point synchronisation, the write pointer of the CSB > ringbuffer is updated _prior_ to the contents of the ringbuffer. That is > we see the GPU report more context-switch entries for us to parse, but > those entries have not been written, leading us to process stale events, > and eventually report a hung GPU. > > However, this effect appears to be much more severe than we previously > saw on Icelake (though it might be best if we try the same approach > there as well and measure), and Bruce suggested the good idea of resetting > the CSB entry after use so that we can detect when it has been updated by > the GPU. By instrumenting how long that may be, we can set a reliable > upper bound for how long we should wait for: > > 513 late, avg of 61 retries (590 ns), max of 1061 retries (10099 ns) > > Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2045 > References: d8f505311717 ("drm/i915/icl: Forcibly evict stale csb entries") References: HSDES#1508287568 > Suggested-by: Bruce Chang <yu.bruce.chang@xxxxxxxxx> > Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > Cc: Bruce Chang <yu.bruce.chang@xxxxxxxxx> > Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx> > Cc: stable@xxxxxxxxxxxxxxx # v5.4 > --- > drivers/gpu/drm/i915/gt/intel_lrc.c | 21 ++++++++++++++++++--- > 1 file changed, 18 insertions(+), 3 deletions(-) > > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c > index d6e0f62337b4..d75712a503b7 100644 > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c > @@ -2498,9 +2498,22 @@ invalidate_csb_entries(const u64 *first, const u64 *last) > */ > static inline bool gen12_csb_parse(const u64 *csb) > { > - u64 entry = READ_ONCE(*csb); > - bool ctx_away_valid = GEN12_CSB_CTX_VALID(upper_32_bits(entry)); > - bool new_queue = > + bool ctx_away_valid; > + bool new_queue; > + u64 entry; > + > + /* HSD#22011248461 */ s/220011248461/1508287568 > + entry = READ_ONCE(*csb); > + if (unlikely(entry == -1)) { > + preempt_disable(); As this seems to rather rare, consider falling back to mmio read for this entry without the poll? -Mika > + if (wait_for_atomic_us((entry = READ_ONCE(*csb)) != -1, 50)) > + GEM_WARN_ON("50us CSB timeout"); > + preempt_enable(); > + } > + WRITE_ONCE(*(u64 *)csb, -1); > + > + ctx_away_valid = GEN12_CSB_CTX_VALID(upper_32_bits(entry)); > + new_queue = > lower_32_bits(entry) & GEN12_CTX_STATUS_SWITCHED_TO_NEW_QUEUE; > > /* > @@ -4004,6 +4017,8 @@ static void reset_csb_pointers(struct intel_engine_cs *engine) > WRITE_ONCE(*execlists->csb_write, reset_value); > wmb(); /* Make sure this is visible to HW (paranoia?) */ > > + /* Check that the GPU does indeed update the CSB entries! */ > + memset(execlists->csb_status, -1, (reset_value + 1) * sizeof(u64)); > invalidate_csb_entries(&execlists->csb_status[0], > &execlists->csb_status[reset_value]); > > -- > 2.20.1