Re: [Intel-gfx] [PATCH 13/20] drm/i915/guc: Relax CTB response timeout

Daniel Vetter <daniel@xxxxxxxx> · Wed, 9 Jun 2021 15:24:00 +0200



On Fri, Jun 04, 2021 at 11:35:39AM -0700, Matthew Brost wrote:
> On Fri, Jun 04, 2021 at 10:33:07AM +0200, Daniel Vetter wrote:
> > On Wed, Jun 02, 2021 at 10:16:23PM -0700, Matthew Brost wrote:
> > > From: Michal Wajdeczko <michal.wajdeczko@xxxxxxxxx>
> > > 
> > > In upcoming patch we will allow more CTB requests to be sent in
> > > parallel to the GuC for processing, so we shouldn't assume any more
> > > that GuC will always reply without 10ms.
> > > 
> > > Use bigger value from CONFIG_DRM_I915_GUC_CTB_TIMEOUT instead.
> > > 
> > > v2: Add CONFIG_DRM_I915_GUC_CTB_TIMEOUT config option
> > > 
> > > Signed-off-by: Michal Wajdeczko <michal.wajdeczko@xxxxxxxxx>
> > > Signed-off-by: Matthew Brost <matthew.brost@xxxxxxxxx>
> > > Reviewed-by: Matthew Brost <matthew.brost@xxxxxxxxx>
> > 
> > So this is a rant, but for upstream we really need to do better than
> > internal:
> > 
> > - The driver must work by default in the optimal configuration.
> > 
> > - Any config change that we haven't validated _must_ taint the kernel
> >   (this is especially for module options, but also for config settings)
> > 
> > - Config need a real reason beyond "was useful for bring-up".
> > 
> > Our internal tree is an absolute disaster right now, with multi-line
> > kernel configs (different on each platform) and bespoke kernel config or
> > the driver just fails. We're the expert on our own hw, we should know how
> > it works, not offload that to users essentially asking them "how shitty do
> > you think Intel hw is in responding timely".
> > 
> > Yes I know there's a lot of these there already, they don't make a lot of
> > sense either.
> > 
> > Except if there's a real reason for this (aside from us just offloading
> > testing to our users instead of doing it ourselves properly) I think we
> > should hardcode this, with a comment explaining why. Maybe with a switch
> > between the PF/VF case once that's landed.
> > 
> > > ---
> > >  drivers/gpu/drm/i915/Kconfig.profile      | 10 ++++++++++
> > >  drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c |  5 ++++-
> > >  2 files changed, 14 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
> > > index 39328567c200..0d5475b5f28a 100644
> > > --- a/drivers/gpu/drm/i915/Kconfig.profile
> > > +++ b/drivers/gpu/drm/i915/Kconfig.profile
> > > @@ -38,6 +38,16 @@ config DRM_I915_USERFAULT_AUTOSUSPEND
> > >  	  May be 0 to disable the extra delay and solely use the device level
> > >  	  runtime pm autosuspend delay tunable.
> > >  
> > > +config DRM_I915_GUC_CTB_TIMEOUT
> > > +	int "How long to wait for the GuC to make forward progress on CTBs (ms)"
> > > +	default 1500 # milliseconds
> > > +	range 10 60000
> > 
> > Also range is definitely off, drm/scheduler will probably nuke you
> > beforehand :-)
> > 
> > That's kinda another issue I have with all these kconfig knobs: Maybe we
> > need a knob for "relax with reset attempts, my workloads overload my gpus
> > routinely", which then scales _all_ timeouts proportionally. But letting
> > the user set them all, with silly combiniations like resetting the
> > workload before heartbeat or stuff like that doesn't make much sense.
> >
> 
> Yes, the code as is the user could do some wacky stuff that doesn't make
> sense at all.
>  
> > Anyway, tiny patch so hopefully I can leave this one out for now until
> > we've closed this.
> 
> No issue leaving this out as blocking CTBs are never really used anyways
> until SRIOV aside from setup / debugging. That being said, we might
> still want a higher hardcoded value in the meantime, perhaps around a
> second. I can follow up on that if needed.

Yeah just patch with updated hardcoded value sounds good to me.
-Daniel

> 
> Matt
> 
> > -Daniel
> > 
> > > +	help
> > > +	  Configures the default timeout waiting for GuC the to make forward
> > > +	  progress on CTBs. e.g. Waiting for a response to a requeset.
> > > +
> > > +	  A range of 10 ms to 60000 ms is allowed.
> > > +
> > >  config DRM_I915_HEARTBEAT_INTERVAL
> > >  	int "Interval between heartbeat pulses (ms)"
> > >  	default 2500 # milliseconds
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> > > index 916c2b80c841..cf1fb09ef766 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> > > @@ -436,6 +436,7 @@ static int ct_write(struct intel_guc_ct *ct,
> > >   */
> > >  static int wait_for_ct_request_update(struct ct_request *req, u32 *status)
> > >  {
> > > +	long timeout;
> > >  	int err;
> > >  
> > >  	/*
> > > @@ -443,10 +444,12 @@ static int wait_for_ct_request_update(struct ct_request *req, u32 *status)
> > >  	 * up to that length of time, then switch to a slower sleep-wait loop.
> > >  	 * No GuC command should ever take longer than 10ms.
> > >  	 */
> > > +	timeout = CONFIG_DRM_I915_GUC_CTB_TIMEOUT;
> > > +
> > >  #define done INTEL_GUC_MSG_IS_RESPONSE(READ_ONCE(req->status))
> > >  	err = wait_for_us(done, 10);
> > >  	if (err)
> > > -		err = wait_for(done, 10);
> > > +		err = wait_for(done, timeout);
> > >  #undef done
> > >  
> > >  	if (unlikely(err))
> > > -- 
> > > 2.28.0
> > > 
> > > _______________________________________________
> > > Intel-gfx mailing list
> > > Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
> > > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch