On Tue, Jan 03, 2017 at 01:17:19PM +0000, Tvrtko Ursulin wrote: > > On 03/01/2017 12:38, Chris Wilson wrote: > >On Tue, Jan 03, 2017 at 12:34:16PM +0000, Tvrtko Ursulin wrote: > >> > >>On 03/01/2017 12:13, Chris Wilson wrote: > >>>On Tue, Jan 03, 2017 at 11:57:44AM +0000, Tvrtko Ursulin wrote: > >>>> > >>>>On 03/01/2017 11:46, Chris Wilson wrote: > >>>>>On Tue, Jan 03, 2017 at 11:34:45AM +0000, Tvrtko Ursulin wrote: > >>>>>> > >>>>>>On 03/01/2017 11:05, Chris Wilson wrote: > >>>>>>>The struct dma_fence carries a status field exposed to userspace by > >>>>>>>sync_file. This is inspected after the fence is signaled and can convey > >>>>>>>whether or not the request completed successfully, or in our case if we > >>>>>>>detected a hang during the request (signaled via -EIO in > >>>>>>>SYNC_IOC_FILE_INFO). > >>>>>>> > >>>>>>>Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > >>>>>>>Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx> > >>>>>>>Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx> > >>>>>>>--- > >>>>>>>drivers/gpu/drm/i915/i915_gem.c | 6 ++++-- > >>>>>>>1 file changed, 4 insertions(+), 2 deletions(-) > >>>>>>> > >>>>>>>diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > >>>>>>>index 204c4a673bf3..bc99c0e292d8 100644 > >>>>>>>--- a/drivers/gpu/drm/i915/i915_gem.c > >>>>>>>+++ b/drivers/gpu/drm/i915/i915_gem.c > >>>>>>>@@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine) > >>>>>>> ring_hung = false; > >>>>>>> } > >>>>>>> > >>>>>>>- if (ring_hung) > >>>>>>>+ if (ring_hung) { > >>>>>>> i915_gem_context_mark_guilty(request->ctx); > >>>>>>>- else > >>>>>>>+ request->fence.status = -EIO; > >>>>>>>+ } else { > >>>>>>> i915_gem_context_mark_innocent(request->ctx); > >>>>>>>+ } > >>>>>>> > >>>>>>> if (!ring_hung) > >>>>>>> return; > >>>>>>> > >>>>>> > >>>>>>Reading what happens later in this function, should we set the > >>>>>>status of all the other requests we are about to clear? > >>>>>> > >>>>>>However one thing I don't understand is how this scheme interacts > >>>>>>with the current userspace. We will clear/no-nop some of the > >>>>>>submitted requests since the state is corrupt. But how will > >>>>>>userspace notice this before it submits more requets? > >>>>> > >>>>>There is no mechanism currently for user space to be able to detect a > >>>>>hung request. (It can use the uevent for async notification of the > >>>>>hang/reset, but that will not tell you who caused the hang.) Userspace > >>>>>can track the number of hangs it caused, but the delay makes any > >>>>>roundtripping impractical (i.e. you have to synchronise before all > >>>>>rendering if you must detect the event immediately). Note also that we > >>>>>do not want to give out interprocess information (i.e. to allow one > >>>>>client to spy on another), which makes things harder to get right. > >>>> > >>>>So idea is to clear already submitted requests _if_ the userspace is > >>>>synchronising before all rendering and looking at reset stats, to > >>>>make it theoretically possible to detect the corrupt state? > >>> > >>>No, I'm just don't see a way that userspace can detect the hang without > >>>testing after seeing the request signaled (either by waiting on the > >>>batch or by waiting on the fence), i.e. by being completely synchronous > >>>(or at least chosing its synchronous points very carefully, such as > >>>around IPC). It can either poll reset-count or use sync_file (which > >>>requires fence exporting). > >>> > >>>The current robustness interfaces is a basic query on whether any reset > >>>occurred within the context, not when. > >> > >>Why do we bother with clearing the submitted requests then? > > > >The same reason we ban processes from submitting new requests if they > >cause repeated hangs. If before we ban that client, it has already > >submitted 1000 hanging requests, it has successfully locked the machine > >up for a couple of hours. > > So we would need to gate clearing on the transition to banned state > I think. Because currently it does in unconditionally. Yes, see the other patch :) -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx