On 25/05/2022 19:14, Lucas De Marchi wrote:
On Wed, May 25, 2022 at 05:03:13PM +0100, Tvrtko Ursulin wrote:
On 24/05/2022 18:51, Matt Roper wrote:
On Tue, May 24, 2022 at 10:43:39AM +0100, Tvrtko Ursulin wrote:
From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
Catch and log any garbage in the register, including no tiles
marked, or
multiple tiles marked.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
Cc: Matt Roper <matthew.d.roper@xxxxxxxxx>
---
We caught garbage in DG1_MSTR_TILE_INTR with DG2 (actual value
0xF9D2C008)
during glmark and more badness. So I thought lets log all possible
failure
modes from here and also use per device logging.
---
drivers/gpu/drm/i915/i915_irq.c | 33 ++++++++++++++++++++++-----------
drivers/gpu/drm/i915/i915_reg.h | 1 +
2 files changed, 23 insertions(+), 11 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_irq.c
b/drivers/gpu/drm/i915/i915_irq.c
index 73cebc6aa650..79853d3fc1ed 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2778,24 +2778,30 @@ static irqreturn_t dg1_irq_handler(int irq,
void *arg)
u32 gu_misc_iir;
if (!intel_irqs_enabled(i915))
- return IRQ_NONE;
+ goto none;
master_tile_ctl = dg1_master_intr_disable(regs);
- if (!master_tile_ctl) {
- dg1_master_intr_enable(regs);
- return IRQ_NONE;
+ if (!master_tile_ctl)
+ goto enable_none;
+
+ if (master_tile_ctl & ~(DG1_MSTR_IRQ | DG1_MSTR_TILE_MASK)) {
+ drm_warn(&i915->drm, "Garbage in master_tile_ctl: 0x%08x!\n",
+ master_tile_ctl);
I know we have a bunch of them already, but shouldn't we be avoiding
printk-based stuff like this inside interrupt handlers? Should we be
migrating all these error messages over to trace_printk or something
similar that's safer to use?
Not sure - I kind of think some really unexpected and worrying
situations should be loud and on by default. Risk is then spam if not
ratelimited. Maybe we should instead ratelimit most errors/warnings
coming for irq handlers?
In this particular case at least DRM_ERROR with no device info is the
odd one out in the entire file so I'd suggest changing at least that,
if the rest of my changes is of questionable benefit.
I'd rather remove the printk's from irq rather than adding more. At the
very
least, they should be the _once variant or ratelimited. One of the few
cases to even deserve a unlikely(), even to document this shouldn't
really be happening.
I would support ratelimited for all the unexpected bits set, no bits
set, or similar conditions. I wouldn't remove such printks to
micro-optimize things. That would potentially lose important feedback in
cases when we hit truly unexpected situations.
But annotating them as unlikely would be a good thing.
Our irq handlers (particularly on dgfx and multi-gt) are already too
long running... I don't like making them any onger or slower.
How come? I mean which interrupts are a problem there? Isn't GuC
supposed to be taking on that load on itself, isn't that one of the main
selling points?
Regards,
Tvrtko