Re: [PATCH] drm/i915/dg2: Catch and log more unexpected values in DG1_MSTR_TILE_INTR

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Mon, 6 Jun 2022 12:55:20 +0100

On 27/05/2022 19:42, Matt Roper wrote:
On Thu, May 26, 2022 at 11:18:17AM +0100, Tvrtko Ursulin wrote:
On 25/05/2022 19:05, Matt Roper wrote:
On Wed, May 25, 2022 at 05:03:13PM +0100, Tvrtko Ursulin wrote:

On 24/05/2022 18:51, Matt Roper wrote:
On Tue, May 24, 2022 at 10:43:39AM +0100, Tvrtko Ursulin wrote:
From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>

Catch and log any garbage in the register, including no tiles marked, or
multiple tiles marked.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
Cc: Matt Roper <matthew.d.roper@xxxxxxxxx>
---
We caught garbage in DG1_MSTR_TILE_INTR with DG2 (actual value 0xF9D2C008)
during glmark and more badness. So I thought lets log all possible failure
modes from here and also use per device logging.
---
    drivers/gpu/drm/i915/i915_irq.c | 33 ++++++++++++++++++++++-----------
    drivers/gpu/drm/i915/i915_reg.h |  1 +
    2 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 73cebc6aa650..79853d3fc1ed 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2778,24 +2778,30 @@ static irqreturn_t dg1_irq_handler(int irq, void *arg)
    	u32 gu_misc_iir;
    	if (!intel_irqs_enabled(i915))
-		return IRQ_NONE;
+		goto none;
    	master_tile_ctl = dg1_master_intr_disable(regs);
-	if (!master_tile_ctl) {
-		dg1_master_intr_enable(regs);
-		return IRQ_NONE;
+	if (!master_tile_ctl)
+		goto enable_none;
+
+	if (master_tile_ctl & ~(DG1_MSTR_IRQ | DG1_MSTR_TILE_MASK)) {
+		drm_warn(&i915->drm, "Garbage in master_tile_ctl: 0x%08x!\n",
+			 master_tile_ctl);

I know we have a bunch of them already, but shouldn't we be avoiding
printk-based stuff like this inside interrupt handlers?  Should we be
migrating all these error messages over to trace_printk or something
similar that's safer to use?

Not sure - I kind of think some really unexpected and worrying situations
should be loud and on by default. Risk is then spam if not ratelimited.
Maybe we should instead ratelimit most errors/warnings coming for irq
handlers?

It's not the risk of spam that's the problem, but rather that
printk-based stuff eventually calls into the console code to flush its
buffers.  That's way more overhead than you want in an interrupt handler
so it's bad on its own, but if you're using something slow like a serial
console, it becomes even more of a problem.

Is it a problem for messages which we never expect to see?

Kind of.  While not as catastrophic, it's the same argument for why we
don't use BUG() anymore...when the impossible does manage to happen
there's unnecessary collateral damage on things outside of graphics.  If
we're adding huge delays inside an interrupt handler (while other
interrupts are disabled) that impacts the system-wide usability, not
just our own driver.

I'd also argue that these messages actually are semi-expected.  Random
bits being set shouldn't happen, but in the world of dgpu's, we do
occasionally see cases where the PCI link itself goes down for reasons
outside our control and then all registers read back as 0xFFFFFFFF,
which will probably trigger error messages here (as well as a bunch of
other places).

Could you expand a bit on what is semi-expected and when? I mean the 
circumstances of PCI link going down. We certainly don't have any code 
to survive that.

While the unexpected bits in the master tile register are strange and
may point to a bigger problem somewhere else, they're also harmless on
their own since we should just ignore those bits and only process the
valid tiles.

Yes, I was expecting that a patch belonging to multi-tile enablement would
be incoming soon, which would be changing:

+	if (REG_FIELD_GET(DG1_MSTR_TILE_MASK, master_tile_ctl) !=
+	    DG1_MSTR_TILE(0)) {
+		drm_warn(&i915->drm, "Unexpected irq from tile %u!\n",
+			 ilog2(REG_FIELD_GET(DG1_MSTR_TILE_MASK,
+					     master_tile_ctl)));
+		goto enable_none;
  	}

 From this patch, into something completely different like walking bit by
bit, handling the present tiles, and warning on unexpected ones. What should
remain though is warning on no tiles signaled (which what we saw, together
with garbage in reserved bits).

Yeah.  Although I still feel the interrupt handler should really just be
flagging the errors so that the actual prints themselves can happen
outside the interrupt.


In this particular case at least DRM_ERROR with no device info is the odd
one out in the entire file so I'd suggest changing at least that, if the
rest of my changes is of questionable benefit.

Changing DRM_ERROR -> drm_err would probably be fine in the short term
since it doesn't really make us any worse off.  Changing to drm_warn
might not be great since we're generating a lot more lines of output and

Sorry I don't follow - why does replacing drm_err with drm_warn generate (a
lot) more lines of output?

Sorry, my mistake; I had it in my mind that we were talking about a
drm_WARN_ON rather than just drm_warn (i.e., including a big stacktrace
and such).  DRM_ERROR -> drm_warn alone shouldn't have any extra
negative impact.


But it can be drm_err for all I care, I don't think we really have
consistent story between errors and warnings in this area.

probably multiplying the already bad overhead that shouldn't be
happening in an interrupt handler.  But if we could update the interrupt
handler to just save away the details and do the actual drm_warn later,
outside the interrupt handler code, that would be okay.  We should
probably work toward something like that for all of our interrupt
handler warning/error messages.

Not sure I agree - for messages which we don't expect to see it doesn't
really matter that there will be overhead when they are hit. Presumably bad
things are already happening there so spending effort to optimise those path
is questionable.

Something bad is happening to graphics is we hit one of these cases.
But if we start doing prints while interrupts are disabled, we start
having more of a negative impact on the rest of the system too.

Truly for the case of this particular patch I don't think we should 
care. Rate limiting should be all that is needed in the short term to 
strike a balance between effort and benefit. But lets first clarify the 
PCI link going down problem.

Regards,

Tvrtko