Quoting Mika Kuoppala (2019-02-19 11:22:32) > > diff --git a/drivers/gpu/drm/i915/i915_reset.c b/drivers/gpu/drm/i915/i915_reset.c > > index 1911e00d2581..bae88a4ea924 100644 > > --- a/drivers/gpu/drm/i915/i915_reset.c > > +++ b/drivers/gpu/drm/i915/i915_reset.c > > @@ -59,24 +59,29 @@ static void client_mark_guilty(struct drm_i915_file_private *file_priv, > > > > static bool context_mark_guilty(struct i915_gem_context *ctx) > > { > > - unsigned int score; > > - bool banned, bannable; > > + unsigned long prev_hang; > > + bool banned; > > + int i; > > > > atomic_inc(&ctx->guilty_count); > > > > - bannable = i915_gem_context_is_bannable(ctx); > > - score = atomic_add_return(CONTEXT_SCORE_GUILTY, &ctx->ban_score); > > - banned = (!i915_gem_context_is_recoverable(ctx) || > > - score >= CONTEXT_SCORE_BAN_THRESHOLD); > > - > > /* Cool contexts don't accumulate client ban score */ > > This comment becomes misleading and can be removed. /* Cool contexts are too cool to be banned */ > > > - if (!bannable) > > + if (!i915_gem_context_is_bannable(ctx)) > > return false; > > > > + /* Record the timestamp for the last N hangs */ > > + prev_hang = ctx->hang_timestamp[0]; > > + for (i = 0; i < ARRAY_SIZE(ctx->hang_timestamp) - 1; i++) > > + ctx->hang_timestamp[i] = ctx->hang_timestamp[i + 1]; > > + ctx->hang_timestamp[i] = jiffies; > > + > > + /* If we have hung N+1 times in rapid succession, we ban the context! */ > > + banned = !i915_gem_context_is_recoverable(ctx); > > + if (time_before(jiffies, prev_hang + CONTEXT_FAST_HANG_JIFFIES)) > > + banned = true; > > Ok, the initialization to jiffies - CONTEXT_FAST_HANG_JIFFIES guarantees > that it cant be banned even if it manages to immediately hang. Which > in itself is difficult feat to accomplish. > > > if (banned) { > > - DRM_DEBUG_DRIVER("context %s: guilty %d, score %u, banned\n", > > - ctx->name, atomic_read(&ctx->guilty_count), > > - score); > > + DRM_DEBUG_DRIVER("context %s: guilty %d, banned\n", > > + ctx->name, > > atomic_read(&ctx->guilty_count)); > > Now when we know when it previously hanged, we could improve the > logging a bit by saying how long ago. But not sure > about if it worth the trouble. ~o~ Not sure yet if that will be helpful. The secondary hangs do not get captured in error-state, so for post-mortem debugging we don't know if the context is banned (yet). For clients trying to run, the first they know about being banned is -EIO from gem_execbuf. Looking back at dmesg, there's just a string of GPU resets, already too late for adding drm.debug=0x2 to dmesg. -EIO either means client banned or driver is wedged. In either case, the first course of action is to debug the initial hang. (Or in the case of mesa recovery patches, to recreate the context and try again.) From the perspective of the current patches, we are teaching clients how to handle such events without "distrubing" the user, even less incentive for large amount of debugging into the secondary banning, fix the primary hang! Yeah, I'm thinking that we don't need much more than some driver debugging to clarify that the context was banned, everything else should be discoverable though the uapi; and the real debugging is always the first hang. -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx