Quoting Bloomfield, Jon (2019-08-26 14:39:55) > > -----Original Message----- > > From: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > > + /* > > + * Send a "high priority pulse" down the engine to cause the > > + * current request to be momentarily preempted. (If it fails to > > + * be preempted, it will be reset). As we have marked our context > > + * as banned, any incomplete request, including any running, will > > + * be skipped following the preemption. > > + */ > > + reset = 0; > > + for_each_engine_masked(engine, gt->i915, active, tmp) > > + if (intel_engine_pulse(engine)) > > + reset |= engine->mask; > > + > > + /* > > + * If we are unable to send a preemptive pulse to bump > > + * the context from the GPU, we have to resort to a full > > + * reset. We hope the collateral damage is worth it. > > + */ > > + if (reset) > > + intel_gt_handle_error(gt, reset, 0, > > + "context closure in %s", ctx->name); > > This seems inconsistent with the policy not to allow non-persistence without pre-emption, since if we can't pre-empt we nuke anyway. So, the only way to get here is if i915.enable_hangcheck=0 on older hardware; the user has forced ourselves into a situation we do not like. Having recognised that i915.enable_hangcheck=0 is a trivial way to accidentally dos (as opposed to the deliberate dos that is the expected behaviour), this is our mitigation. > But this feels unsafe to me - How does intel_gt_handle_error prevent us from nuking a following context, instead of the target? Ideally we would: > 1) Unqueue any context currently behind the target context > 2) Reset engine only if the target context is running (it could complete during 1) > 3) Requeue other contexts > > If the above is viable (?) we don't even need to attempt pre-emption. That's exactly what the safe strategy attempts before we fallback to the reset path. (With a bit of handwaving over gen8, it could do with a bit of refinement as it has kernel preemption, but not user preemption.) > > +static int > > +set_persistence(struct i915_gem_context *ctx, > > + const struct drm_i915_gem_context_param *args) > > +{ > > + if (args->size) > > + return -EINVAL; > > + > > + if (args->value) { > > + i915_gem_context_set_persistence(ctx); > > + return 0; > > + } > > + > > + /* To cancel a context we use "preempt-to-idle" */ > > + if (!(ctx->i915->caps.scheduler & > > Why do we need to give up on older devices? If you fail to preempt you reset the context anyway, so can't we just use the reset fallback path? The fallback path is to reset the entire gpu with no regard as to what is actually running. So we only allow the context parameter if we can safely kill the context on closure. -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx