Hi Zhanjun, On Tue, Aug 22, 2023 at 11:53:24AM -0700, John Harrison wrote: > On 8/11/2023 11:20, Zhanjun Dong wrote: > > This attempts to avoid circular locking dependency between flush delayed > work and intel_gt_reset. > When intel_gt_reset was called, task will hold a lock. > To cacel delayed work here, the _sync version will also acquire a lock, > which might trigger the possible cirular locking dependency warning. > When intel_gt_reset called, reset_in_progress flag will be set, add code > to check the flag, call async verion if reset is in progress. I liked the previous commit, it just needed to be wrapped (not in the dmesg copy-paste part). > Signed-off-by: Zhanjun Dong <zhanjun.dong@xxxxxxxxx> > Cc: John Harrison <John.C.Harrison@xxxxxxxxx> > Cc: Andi Shyti <andi.shyti@xxxxxxxxxxxxxxx> > Cc: Daniel Vetter <daniel@xxxxxxxx> > --- > drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 ++++++++++- > 1 file changed, 10 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > index a0e3ef1c65d2..600388c849f7 100644 > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > @@ -1359,7 +1359,16 @@ static void guc_enable_busyness_worker(struct intel_guc *guc) > > static void guc_cancel_busyness_worker(struct intel_guc *guc) > { > - cancel_delayed_work_sync(&guc->timestamp.work); > + /* > + * When intel_gt_reset was called, task will hold a lock. > + * To cacel delayed work here, the _sync version will also acquire a lock, which might > + * trigger the possible cirular locking dependency warning. > + * Check the reset_in_progress flag, call async verion if reset is in progress. > + */ Indeed the commit message is a bit misleading and it raises some alarms if explained it this way. > This needs to explain in much more detail what is going on and why it is not a > problem. E.g.: > > The busyness worker needs to be cancelled. In general that means using the > synchronous cancel version to ensure that an in-progress worker will not > keep executing beyond whatever is happening that needs the cancel. E.g. > suspend, driver unload, etc. However, in the case of a reset, the > synchronous version is not required and can trigger a false deadlock > detection warning. > > The business worker takes the reset mutex to protect against resets > interfering with it. However, it does a trylock and bails out if the reset > lock is already acquired. Thus there is no actual deadlock or other concern > with the worker running concurrently with a reset. So an asynchronous > cancel is safe in the case of a reset rather than a driver unload or > suspend type operation. On the other hand, if the cancel_sync version is > used when a reset is in progress then the mutex deadlock detection sees the > mutex being acquired through multiple paths and complains. > > So just don't bother. That keeps the detection code happy and is safe > because of the trylock code described above. Can you please update the commit message with John's suggestion? Is there any further question on this? Andi