Quoting Bloomfield, Jon (2019-10-02 14:52:32) > > > > -----Original Message----- > > From: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > > Sent: Wednesday, October 2, 2019 4:20 AM > > To: intel-gfx@xxxxxxxxxxxxxxxxxxxxx > > Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>; Joonas Lahtinen > > <joonas.lahtinen@xxxxxxxxxxxxxxx>; Winiarski, Michal > > <michal.winiarski@xxxxxxxxx>; Bloomfield, Jon <jon.bloomfield@xxxxxxxxx> > > Subject: [PATCH 20/30] drm/i915: Cancel non-persistent contexts on close > > > > Normally, we rely on our hangcheck to prevent persistent batches from > > hogging the GPU. However, if the user disables hangcheck, this mechanism > > breaks down. Despite our insistence that this is unsafe, the users are > > equally insistent that they want to use endless batches and will disable > > the hangcheck mechanism. We are looking at perhaps replacing hangcheck > > with a softer mechanism, that sends a pulse down the engine to check if > > it is well. We can use the same preemptive pulse to flush an active > > persistent context off the GPU upon context close, preventing resources > > being lost and unkillable requests remaining on the GPU after process > > termination. To avoid changing the ABI and accidentally breaking > > existing userspace, we make the persistence of a context explicit and > > enable it by default (matching current ABI). Userspace can opt out of > > persistent mode (forcing requests to be cancelled when the context is > > closed by process termination or explicitly) by a context parameter. To > > facilitate existing use-cases of disabling hangcheck, if the modparam is > > disabled (i915.enable_hangcheck=0), we disable persistence mode by > > default. (Note, one of the outcomes for supporting endless mode will be > > the removal of hangchecking, at which point opting into persistent mode > > will be mandatory, or maybe the default perhaps controlled by cgroups.) > > > > Testcase: igt/gem_ctx_persistence > > Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > > Cc: Joonas Lahtinen <joonas.lahtinen@xxxxxxxxxxxxxxx> > > Cc: Michał Winiarski <michal.winiarski@xxxxxxxxx> > > Cc: Jon Bloomfield <jon.bloomfield@xxxxxxxxx> > > Reviewed-by: Jon Bloomfield <jon.bloomfield@xxxxxxxxx> > > --- > > drivers/gpu/drm/i915/Makefile | 1 + > > drivers/gpu/drm/i915/gem/i915_gem_context.c | 122 ++++++++++++++++++ > > drivers/gpu/drm/i915/gem/i915_gem_context.h | 15 +++ > > .../gpu/drm/i915/gem/i915_gem_context_types.h | 1 + > > .../gpu/drm/i915/gem/selftests/mock_context.c | 2 + > > .../gpu/drm/i915/gt/intel_engine_heartbeat.c | 56 ++++++++ > > .../gpu/drm/i915/gt/intel_engine_heartbeat.h | 14 ++ > > drivers/gpu/drm/i915/gt/intel_engine_pm.c | 2 +- > > drivers/gpu/drm/i915/i915_priolist_types.h | 1 + > > include/uapi/drm/i915_drm.h | 15 +++ > > 10 files changed, 228 insertions(+), 1 deletion(-) > > create mode 100644 drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c > > create mode 100644 drivers/gpu/drm/i915/gt/intel_engine_heartbeat.h > > > > diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile > > index 1f87c70a2842..561fa2bb3006 100644 > > --- a/drivers/gpu/drm/i915/Makefile > > +++ b/drivers/gpu/drm/i915/Makefile > > @@ -76,6 +76,7 @@ gt-y += \ > > gt/intel_breadcrumbs.o \ > > gt/intel_context.o \ > > gt/intel_engine_cs.o \ > > + gt/intel_engine_heartbeat.o \ > > gt/intel_engine_pm.o \ > > gt/intel_engine_pool.o \ > > gt/intel_engine_sysfs.o \ > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c > > b/drivers/gpu/drm/i915/gem/i915_gem_context.c > > index 0ab416887fc2..e832238a72e5 100644 > > --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c > > +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c > > <SNIP> > > > +static int > > +set_persistence(struct i915_gem_context *ctx, > > + const struct drm_i915_gem_context_param *args) > > +{ > > + if (args->size) > > + return -EINVAL; > > + > > + if (args->value) { > > + i915_gem_context_set_persistence(ctx); > > + return 0; > > + } > > + > > + /* To cancel a context we use "preempt-to-idle" */ > > + if (!(ctx->i915->caps.scheduler & > > I915_SCHEDULER_CAP_PREEMPTION)) > > + return -ENODEV; > > + > > + i915_gem_context_clear_persistence(ctx); > > + return 0; > > +} > > Hmmn. Given that disabling hangcheck is an explicit operation, and we already change the default setting, can't we make it a hard requirement that persistence requires hangcheck? You should not really be able to opt back in to persistence if hangcheck is disabled. In fact you could just test for hangcheck when deciding whether to kill the context, and force-kill if it is off - that way if hangcheck is disabled after a context starts it will get cleaned up. I hear the toctou argument, but I really, really want to avoid coupling in the modparam as much possible (I'd rather kill off the modparam entirely for being too coarse). But certainly having a check here saying you cannot re-enable persistence without hangcheck seems valid. Let's see how that looks. (That takes a bit of lalala to ignore the toctou.) -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx