Re: [PATCH v5] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset

John Harrison <john.c.harrison@xxxxxxxxx> · Mon, 28 Aug 2023 16:01:38 -0700

On 8/23/2023 10:37, John Harrison wrote:
On 8/23/2023 09:00, Daniel Vetter wrote:
On Tue, Aug 22, 2023 at 11:53:24AM -0700, John Harrison wrote:
On 8/11/2023 11:20, Zhanjun Dong wrote:
This attempts to avoid circular locking dependency between flush 
delayed
work and intel_gt_reset.
When intel_gt_reset was called, task will hold a lock.
To cacel delayed work here, the _sync version will also acquire a 
lock,
which might trigger the possible cirular locking dependency warning.
When intel_gt_reset called, reset_in_progress flag will be set, add 
code
to check the flag, call async verion if reset is in progress.

Signed-off-by: Zhanjun Dong<zhanjun.dong@xxxxxxxxx>
Cc: John Harrison<John.C.Harrison@xxxxxxxxx>
Cc: Andi Shyti<andi.shyti@xxxxxxxxxxxxxxx>
Cc: Daniel Vetter<daniel@xxxxxxxx>
---
   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 ++++++++++-
   1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index a0e3ef1c65d2..600388c849f7 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1359,7 +1359,16 @@ static void 
guc_enable_busyness_worker(struct intel_guc *guc)
   static void guc_cancel_busyness_worker(struct intel_guc *guc)
   {
-    cancel_delayed_work_sync(&guc->timestamp.work);
+    /*
+     * When intel_gt_reset was called, task will hold a lock.
+     * To cacel delayed work here, the _sync version will also 
acquire a lock, which might
+     * trigger the possible cirular locking dependency warning.
+     * Check the reset_in_progress flag, call async verion if 
reset is in progress.
+     */
This needs to explain in much more detail what is going on and why 
it is not
a problem. E.g.:

    The busyness worker needs to be cancelled. In general that means
    using the synchronous cancel version to ensure that an in-progress
    worker will not keep executing beyond whatever is happening that
    needs the cancel. E.g. suspend, driver unload, etc. However, in the
    case of a reset, the synchronous version is not required and can
    trigger a false deadlock detection warning.

    The business worker takes the reset mutex to protect against resets
    interfering with it. However, it does a trylock and bails out if 
the
    reset lock is already acquired. Thus there is no actual deadlock or
    other concern with the worker running concurrently with a reset. So
    an asynchronous cancel is safe in the case of a reset rather than a
    driver unload or suspend type operation. On the other hand, if the
    cancel_sync version is used when a reset is in progress then the
    mutex deadlock detection sees the mutex being acquired through
    multiple paths and complains.

    So just don't bother. That keeps the detection code happy and is
    safe because of the trylock code described above.
So why do we even need to cancel anything if it doesn't do anything 
while
the reset is in progress?
It still needs to be cancelled. The worker only aborts if it is 
actively executing concurrently with the reset. It might not start to 
execute until after the reset has completed. And there is presumably a 
reason why the cancel is being called, a reason not necessarily 
related to resets at all. Leaving the worker to run arbitrarily after 
the driver is expecting it to be stopped will lead to much worse 
things than a fake lockdep splat, e.g. a use after free pointer deref.

John.
@Daniel Vetter - ping? Is this explanation sufficient? Are you okay with 
this change now?

John.



Just remove the cancel from the reset path as uneeded instead, and 
explain
why that's ok? Because that's defacto what the cancel_work with a
potential deadlock scenario for cancel_work_sync does, you either don't
need it at all, or the replacement creates a bug.
-Daniel


John.


+    if (guc_to_gt(guc)->uc.reset_in_progress)
+        cancel_delayed_work(&guc->timestamp.work);
+    else
+ cancel_delayed_work_sync(&guc->timestamp.work);
   }
   static void __reset_guc_busyness_stats(struct intel_guc *guc)