Re: [PATCH v5] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset

John Harrison <john.c.harrison@xxxxxxxxx> · Tue, 22 Aug 2023 11:53:24 -0700

On 8/11/2023 11:20, Zhanjun Dong wrote:

    
      This attempts to avoid circular locking dependency between flush delayed
work and intel_gt_reset.
When intel_gt_reset was called, task will hold a lock.
To cacel delayed work here, the _sync version will also acquire a lock,
which might trigger the possible cirular locking dependency warning.
When intel_gt_reset called, reset_in_progress flag will be set, add code
to check the flag, call async verion if reset is in progress.

Signed-off-by: Zhanjun Dong <zhanjun.dong@xxxxxxxxx>
Cc: John Harrison <John.C.Harrison@xxxxxxxxx>
Cc: Andi Shyti <andi.shyti@xxxxxxxxxxxxxxx>
Cc: Daniel Vetter <daniel@xxxxxxxx>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index a0e3ef1c65d2..600388c849f7 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1359,7 +1359,16 @@ static void guc_enable_busyness_worker(struct intel_guc *guc)
 
 static void guc_cancel_busyness_worker(struct intel_guc *guc)
 {
-	cancel_delayed_work_sync(&guc->timestamp.work);
+	/*
+	 * When intel_gt_reset was called, task will hold a lock.
+	 * To cacel delayed work here, the _sync version will also acquire a lock, which might
+	 * trigger the possible cirular locking dependency warning.
+	 * Check the reset_in_progress flag, call async verion if reset is in progress.
+	 */
    
    This needs to explain in much more detail what is going on and why
    it is not a problem. E.g.:

    The busyness worker needs to be cancelled. In general
      that means using the synchronous cancel version to ensure that an
      in-progress worker will not keep executing beyond whatever is
      happening that needs the cancel. E.g. suspend, driver unload, etc.
      However, in the case of a reset, the synchronous version is not
      required and can trigger a false deadlock detection warning.

      

      The business worker takes the reset mutex to protect against
      resets interfering with it. However, it does a trylock and bails
      out if the reset lock is already acquired. Thus there is no actual
      deadlock or other concern with the worker running concurrently
      with a reset. So an asynchronous cancel is safe in the case of a
      reset rather than a driver unload or suspend type operation. On
      the other hand, if the cancel_sync version is used when a reset is
      in progress then the mutex deadlock detection sees the mutex being
      acquired through multiple paths and complains.

      

      So just don't bother. That keeps the detection code happy and is
      safe because of the trylock code described above.

    
    

    John.

    

    

    
      +	if (guc_to_gt(guc)->uc.reset_in_progress)
+		cancel_delayed_work(&guc->timestamp.work);
+	else
+		cancel_delayed_work_sync(&guc->timestamp.work);
 }
 
 static void __reset_guc_busyness_stats(struct intel_guc *guc)