Re: [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Tue, 3 Jan 2017 13:25:22 +0000



On Tue, Jan 03, 2017 at 01:17:19PM +0000, Tvrtko Ursulin wrote:
> 
> On 03/01/2017 12:38, Chris Wilson wrote:
> >On Tue, Jan 03, 2017 at 12:34:16PM +0000, Tvrtko Ursulin wrote:
> >>
> >>On 03/01/2017 12:13, Chris Wilson wrote:
> >>>On Tue, Jan 03, 2017 at 11:57:44AM +0000, Tvrtko Ursulin wrote:
> >>>>
> >>>>On 03/01/2017 11:46, Chris Wilson wrote:
> >>>>>On Tue, Jan 03, 2017 at 11:34:45AM +0000, Tvrtko Ursulin wrote:
> >>>>>>
> >>>>>>On 03/01/2017 11:05, Chris Wilson wrote:
> >>>>>>>The struct dma_fence carries a status field exposed to userspace by
> >>>>>>>sync_file. This is inspected after the fence is signaled and can convey
> >>>>>>>whether or not the request completed successfully, or in our case if we
> >>>>>>>detected a hang during the request (signaled via -EIO in
> >>>>>>>SYNC_IOC_FILE_INFO).
> >>>>>>>
> >>>>>>>Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> >>>>>>>Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
> >>>>>>>Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx>
> >>>>>>>---
> >>>>>>>drivers/gpu/drm/i915/i915_gem.c | 6 ++++--
> >>>>>>>1 file changed, 4 insertions(+), 2 deletions(-)
> >>>>>>>
> >>>>>>>diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> >>>>>>>index 204c4a673bf3..bc99c0e292d8 100644
> >>>>>>>--- a/drivers/gpu/drm/i915/i915_gem.c
> >>>>>>>+++ b/drivers/gpu/drm/i915/i915_gem.c
> >>>>>>>@@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine)
> >>>>>>>		ring_hung = false;
> >>>>>>>	}
> >>>>>>>
> >>>>>>>-	if (ring_hung)
> >>>>>>>+	if (ring_hung) {
> >>>>>>>		i915_gem_context_mark_guilty(request->ctx);
> >>>>>>>-	else
> >>>>>>>+		request->fence.status = -EIO;
> >>>>>>>+	} else {
> >>>>>>>		i915_gem_context_mark_innocent(request->ctx);
> >>>>>>>+	}
> >>>>>>>
> >>>>>>>	if (!ring_hung)
> >>>>>>>		return;
> >>>>>>>
> >>>>>>
> >>>>>>Reading what happens later in this function, should we set the
> >>>>>>status of all the other requests we are about to clear?
> >>>>>>
> >>>>>>However one thing I don't understand is how this scheme interacts
> >>>>>>with the current userspace. We will clear/no-nop some of the
> >>>>>>submitted requests since the state is corrupt. But how will
> >>>>>>userspace notice this before it submits more requets?
> >>>>>
> >>>>>There is no mechanism currently for user space to be able to detect a
> >>>>>hung request. (It can use the uevent for async notification of the
> >>>>>hang/reset, but that will not tell you who caused the hang.) Userspace
> >>>>>can track the number of hangs it caused, but the delay makes any
> >>>>>roundtripping impractical (i.e. you have to synchronise before all
> >>>>>rendering if you must detect the event immediately). Note also that we
> >>>>>do not want to give out interprocess information (i.e. to allow one
> >>>>>client to spy on another), which makes things harder to get right.
> >>>>
> >>>>So idea is to clear already submitted requests _if_ the userspace is
> >>>>synchronising before all rendering and looking at reset stats, to
> >>>>make it theoretically possible to detect the corrupt state?
> >>>
> >>>No, I'm just don't see a way that userspace can detect the hang without
> >>>testing after seeing the request signaled (either by waiting on the
> >>>batch or by waiting on the fence), i.e. by being completely synchronous
> >>>(or at least chosing its synchronous points very carefully, such as
> >>>around IPC). It can either poll reset-count or use sync_file (which
> >>>requires fence exporting).
> >>>
> >>>The current robustness interfaces is a basic query on whether any reset
> >>>occurred within the context, not when.
> >>
> >>Why do we bother with clearing the submitted requests then?
> >
> >The same reason we ban processes from submitting new requests if they
> >cause repeated hangs. If before we ban that client, it has already
> >submitted 1000 hanging requests, it has successfully locked the machine
> >up for a couple of hours.
> 
> So we would need to gate clearing on the transition to banned state
> I think. Because currently it does in unconditionally.

Yes, see the other patch :)
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx