Re: [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Wed, 07 Aug 2019 15:14:18 +0100

Quoting Bloomfield, Jon (2019-08-07 15:04:16)
> > -----Original Message-----
> > From: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> > Sent: Wednesday, August 7, 2019 6:23 AM
> > To: intel-gfx@xxxxxxxxxxxxxxxxxxxxx
> > Cc: Joonas Lahtinen <joonas.lahtinen@xxxxxxxxxxxxxxx>; Winiarski, Michal
> > <michal.winiarski@xxxxxxxxx>; Bloomfield, Jon <jon.bloomfield@xxxxxxxxx>
> > Subject: Re: [PATCH 5/5] drm/i915: Cancel non-persistent contexts on close
> > 
> > Quoting Chris Wilson (2019-08-06 14:47:25)
> > > @@ -433,6 +482,8 @@ __create_context(struct drm_i915_private *i915)
> > >
> > >         i915_gem_context_set_bannable(ctx);
> > >         i915_gem_context_set_recoverable(ctx);
> > > +       if (i915_modparams.enable_hangcheck)
> > > +               i915_gem_context_set_persistence(ctx);
> > 
> > I am not fond of this, but from a pragmatic point of view, this does
> > prevent the issue Jon raised: HW resources being pinned indefinitely
> > past process termination.
> > 
> > I don't like it because we cannot perform the operation cleanly
> > everywhere, it requires preemption first and foremost (with a cooperating
> > submission backend) and per-engine reset. The alternative is to try and
> > do a full GPU reset if the context is still active. For the sake of the
> > issue raised, I think that (full reset on older HW) is required.
> > 
> > That we are baking in a change of ABI due to an unsafe modparam is meh.
> > There are a few more corner cases to deal with before endless just
> > works. But since it is being used in the wild, I'm not sure we can wait
> > for userspace to opt-in, or wait for cgroups. However, since users are
> > being encouraged to disable hangcheck, should we extend the concept of
> > persistence to also mean "survives hangcheck"? -- though it should be a
> > separate parameter, and I'm not sure how exactly to protect it from the
> > heartbeat reset without giving gross privileges to the context. (CPU
> > isolation is nicer from the pov where we can just give up and not even
> > worry about the engine. If userspace can request isolation, it has the
> > privilege to screw up.)
> 
> Ok, so your concern is supporting non-persistence on older non-preempting, engine-reset capable, hardware. Why is that strictly required? Can't we simply make it dependent on the features needed to do it well, and if your hardware cannot, then the advice is not to disable hangcheck? I'm doubtful that anyone would attempt a HPC type workload on n IVB.

Our advice was not to disable hangcheck :)

With the cat out of the bag, my concern is dotting the Is and crossing
the Ts. Fixing up the error handling path to the reset isn't all that
bad. But I'm not going to advertise the persistence-parameter support
unless we have a clean solution, and we can advise that compute
workloads are better handled with new hardware.

> I'm not sure I understand your "survives hangcheck" idea. You mean instead of simply disabling hangcheck, just opt in to persistence and have that also prevent hangcheck? Isn't that the wrong way around, since persistence is what is happening today?

Persistence is the clear-and-present danger. I'm just trying to sketch a
path for endless support, trying to ask myself questions such as: Is the
persistence parameter still required? What other parameters make sense?
Does anything less than CPU-esque isolation make sense? :)
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx