Re: [PATCH] drm/i915/dg2: Catch and log more unexpected values in DG1_MSTR_TILE_INTR

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, May 25, 2022 at 05:03:13PM +0100, Tvrtko Ursulin wrote:
> 
> On 24/05/2022 18:51, Matt Roper wrote:
> > On Tue, May 24, 2022 at 10:43:39AM +0100, Tvrtko Ursulin wrote:
> > > From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
> > > 
> > > Catch and log any garbage in the register, including no tiles marked, or
> > > multiple tiles marked.
> > > 
> > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
> > > Cc: Matt Roper <matthew.d.roper@xxxxxxxxx>
> > > ---
> > > We caught garbage in DG1_MSTR_TILE_INTR with DG2 (actual value 0xF9D2C008)
> > > during glmark and more badness. So I thought lets log all possible failure
> > > modes from here and also use per device logging.
> > > ---
> > >   drivers/gpu/drm/i915/i915_irq.c | 33 ++++++++++++++++++++++-----------
> > >   drivers/gpu/drm/i915/i915_reg.h |  1 +
> > >   2 files changed, 23 insertions(+), 11 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> > > index 73cebc6aa650..79853d3fc1ed 100644
> > > --- a/drivers/gpu/drm/i915/i915_irq.c
> > > +++ b/drivers/gpu/drm/i915/i915_irq.c
> > > @@ -2778,24 +2778,30 @@ static irqreturn_t dg1_irq_handler(int irq, void *arg)
> > >   	u32 gu_misc_iir;
> > >   	if (!intel_irqs_enabled(i915))
> > > -		return IRQ_NONE;
> > > +		goto none;
> > >   	master_tile_ctl = dg1_master_intr_disable(regs);
> > > -	if (!master_tile_ctl) {
> > > -		dg1_master_intr_enable(regs);
> > > -		return IRQ_NONE;
> > > +	if (!master_tile_ctl)
> > > +		goto enable_none;
> > > +
> > > +	if (master_tile_ctl & ~(DG1_MSTR_IRQ | DG1_MSTR_TILE_MASK)) {
> > > +		drm_warn(&i915->drm, "Garbage in master_tile_ctl: 0x%08x!\n",
> > > +			 master_tile_ctl);
> > 
> > I know we have a bunch of them already, but shouldn't we be avoiding
> > printk-based stuff like this inside interrupt handlers?  Should we be
> > migrating all these error messages over to trace_printk or something
> > similar that's safer to use?
> 
> Not sure - I kind of think some really unexpected and worrying situations
> should be loud and on by default. Risk is then spam if not ratelimited.
> Maybe we should instead ratelimit most errors/warnings coming for irq
> handlers?

It's not the risk of spam that's the problem, but rather that
printk-based stuff eventually calls into the console code to flush its
buffers.  That's way more overhead than you want in an interrupt handler
so it's bad on its own, but if you're using something slow like a serial
console, it becomes even more of a problem.

While the unexpected bits in the master tile register are strange and
may point to a bigger problem somewhere else, they're also harmless on
their own since we should just ignore those bits and only process the
valid tiles.

> 
> In this particular case at least DRM_ERROR with no device info is the odd
> one out in the entire file so I'd suggest changing at least that, if the
> rest of my changes is of questionable benefit.

Changing DRM_ERROR -> drm_err would probably be fine in the short term
since it doesn't really make us any worse off.  Changing to drm_warn
might not be great since we're generating a lot more lines of output and
probably multiplying the already bad overhead that shouldn't be
happening in an interrupt handler.  But if we could update the interrupt
handler to just save away the details and do the actual drm_warn later,
outside the interrupt handler code, that would be okay.  We should
probably work toward something like that for all of our interrupt
handler warning/error messages.


Matt

> 
> Regards,
> 
> Tvrtko
> 
> > 
> > 
> > Matt
> > 
> > > +		goto enable_none;
> > >   	}
> > >   	/* FIXME: we only support tile 0 for now. */
> > > -	if (master_tile_ctl & DG1_MSTR_TILE(0)) {
> > > -		master_ctl = raw_reg_read(regs, GEN11_GFX_MSTR_IRQ);
> > > -		raw_reg_write(regs, GEN11_GFX_MSTR_IRQ, master_ctl);
> > > -	} else {
> > > -		DRM_ERROR("Tile not supported: 0x%08x\n", master_tile_ctl);
> > > -		dg1_master_intr_enable(regs);
> > > -		return IRQ_NONE;
> > > +	if (REG_FIELD_GET(DG1_MSTR_TILE_MASK, master_tile_ctl) !=
> > > +	    DG1_MSTR_TILE(0)) {
> > > +		drm_warn(&i915->drm, "Unexpected irq from tile %u!\n",
> > > +			 ilog2(REG_FIELD_GET(DG1_MSTR_TILE_MASK,
> > > +					     master_tile_ctl)));
> > > +		goto enable_none;
> > >   	}
> > > +	master_ctl = raw_reg_read(regs, GEN11_GFX_MSTR_IRQ);
> > > +	raw_reg_write(regs, GEN11_GFX_MSTR_IRQ, master_ctl);
> > > +
> > >   	gen11_gt_irq_handler(gt, master_ctl);
> > >   	if (master_ctl & GEN11_DISPLAY_IRQ)
> > > @@ -2810,6 +2816,11 @@ static irqreturn_t dg1_irq_handler(int irq, void *arg)
> > >   	pmu_irq_stats(i915, IRQ_HANDLED);
> > >   	return IRQ_HANDLED;
> > > +
> > > +enable_none:
> > > +	dg1_master_intr_enable(regs);
> > > +none:
> > > +	return IRQ_NONE;
> > >   }
> > >   /* Called from drm generic code, passed 'crtc' which
> > > diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> > > index d8579ab9384c..eefa301c6430 100644
> > > --- a/drivers/gpu/drm/i915/i915_reg.h
> > > +++ b/drivers/gpu/drm/i915/i915_reg.h
> > > @@ -5774,6 +5774,7 @@
> > >   #define DG1_MSTR_TILE_INTR		_MMIO(0x190008)
> > >   #define   DG1_MSTR_IRQ			REG_BIT(31)
> > > +#define   DG1_MSTR_TILE_MASK		REG_GENMASK(3, 0)
> > >   #define   DG1_MSTR_TILE(t)		REG_BIT(t)
> > >   #define GEN11_DISPLAY_INT_CTL		_MMIO(0x44200)
> > > -- 
> > > 2.32.0
> > > 
> > 

-- 
Matt Roper
Graphics Software Engineer
VTT-OSGC Platform Enablement
Intel Corporation



[Index of Archives]     [AMD Graphics]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux