On Wed, Mar 24, 2021 at 12:13:28PM +0000, Tvrtko Ursulin wrote: > From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx> > > "Watchdog" aka "restoring hangcheck" aka default request/fence expiry - second > post of a somewhat controversial feature, now upgraded to patch status. > > I quote the "watchdog" becuase in classical sense watchdog would allow userspace > to ping it and so remain alive. > > I quote "restoring hangcheck" because this series, contrary to the old > hangcheck, is not looking at whether the workload is making any progress from > the kernel side either. (Although disclaimer my memory may be leaky - Daniel > suspects old hangcheck had some stricter, more indiscriminatory, angles to it. > But apart from being prone to both false negatives and false positives I can't > remember that myself.) > > Short version - ask is to fail any user submissions after a set time period. In > this RFC that time is twelve seconds. > > Time counts from the moment user submission is "runnable" (implicit and explicit > dependencies have been cleared) and keeps counting regardless of the GPU > contetion caused by other users of the system. > > So semantics are really a bit weak, but again, I understand this is really > really wanted by the DRM core even if I am not convinced it is a good idea. > > There are some dangers with doing this - text borrowed from a patch in the > series: > > This can have an effect that workloads which used to work fine will > suddenly start failing. Even workloads comprised of short batches but in > long dependency chains can be terminated. > > And becuase of lack of agreement on usefulness and safety of fence error > propagation this partial execution can be invisible to userspace even if > it is "listening" to returned fence status. > > Another interaction is with hangcheck where care needs to be taken timeout > is not set lower or close to three times the heartbeat interval. Otherwise > a hang in any application can cause complete termination of all > submissions from unrelated clients. Any users modifying the per engine > heartbeat intervals therefore need to be aware of this potential denial of > service to avoid inadvertently enabling it. > > Given all this I am personally not convinced the scheme is a good idea. > Intuitively it feels object importers would be better positioned to > enforce the time they are willing to wait for something to complete. > > v2: > * Dropped context param. > * Improved commit messages and Kconfig text. > > v3: > * Log timeouts. > * Bump timeout to 20s to see if it helps Tigerlake. I think 20s is a bit much, and seems like problem is still there in igt. I think we need look at that and figure out what to do with it. And then go back down with the timeout somewhat again since 20s is quite a long time. Irrespective of all the additional gaps/opens around watchdog timeout. -Daniel > * Fix sentinel assert. > > v4: > * A round of review feedback applied. > > Chris Wilson (1): > drm/i915: Individual request cancellation > > Tvrtko Ursulin (6): > drm/i915: Extract active lookup engine to a helper > drm/i915: Restrict sentinel requests further > drm/i915: Handle async cancellation in sentinel assert > drm/i915: Request watchdog infrastructure > drm/i915: Fail too long user submissions by default > drm/i915: Allow configuring default request expiry via modparam > > drivers/gpu/drm/i915/Kconfig.profile | 14 ++ > drivers/gpu/drm/i915/gem/i915_gem_context.c | 73 ++++--- > .../gpu/drm/i915/gem/i915_gem_context_types.h | 4 + > drivers/gpu/drm/i915/gt/intel_context_param.h | 11 +- > drivers/gpu/drm/i915/gt/intel_context_types.h | 4 + > .../gpu/drm/i915/gt/intel_engine_heartbeat.c | 1 + > .../drm/i915/gt/intel_execlists_submission.c | 23 +- > .../drm/i915/gt/intel_execlists_submission.h | 2 + > drivers/gpu/drm/i915/gt/intel_gt.c | 3 + > drivers/gpu/drm/i915/gt/intel_gt.h | 2 + > drivers/gpu/drm/i915/gt/intel_gt_requests.c | 28 +++ > drivers/gpu/drm/i915/gt/intel_gt_types.h | 7 + > drivers/gpu/drm/i915/i915_params.c | 5 + > drivers/gpu/drm/i915/i915_params.h | 1 + > drivers/gpu/drm/i915/i915_request.c | 129 ++++++++++- > drivers/gpu/drm/i915/i915_request.h | 16 +- > drivers/gpu/drm/i915/selftests/i915_request.c | 201 ++++++++++++++++++ > 17 files changed, 479 insertions(+), 45 deletions(-) > > -- > 2.27.0 > > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@xxxxxxxxxxxxxxxxxxxxx > https://lists.freedesktop.org/mailman/listinfo/intel-gfx -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx