Re: [PATCH v5] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset

John Harrison <john.c.harrison@xxxxxxxxx> · Wed, 6 Sep 2023 11:49:37 -0700

On 9/6/2023 02:17, Andi Shyti wrote:
Hi John,

     static void guc_cancel_busyness_worker(struct intel_guc *guc)
     {
-	cancel_delayed_work_sync(&guc->timestamp.work);
+	/*
+	 * When intel_gt_reset was called, task will hold a lock.
+	 * To cacel delayed work here, the _sync version will also acquire a lock, which might
+	 * trigger the possible cirular locking dependency warning.
+	 * Check the reset_in_progress flag, call async verion if reset is in progress.
+	 */
This needs to explain in much more detail what is going on and why it is not
a problem. E.g.:

      The busyness worker needs to be cancelled. In general that means
      using the synchronous cancel version to ensure that an in-progress
      worker will not keep executing beyond whatever is happening that
      needs the cancel. E.g. suspend, driver unload, etc. However, in the
      case of a reset, the synchronous version is not required and can
      trigger a false deadlock detection warning.

      The business worker takes the reset mutex to protect against resets
      interfering with it. However, it does a trylock and bails out if the
      reset lock is already acquired. Thus there is no actual deadlock or
      other concern with the worker running concurrently with a reset. So
      an asynchronous cancel is safe in the case of a reset rather than a
      driver unload or suspend type operation. On the other hand, if the
      cancel_sync version is used when a reset is in progress then the
      mutex deadlock detection sees the mutex being acquired through
      multiple paths and complains.

      So just don't bother. That keeps the detection code happy and is
      safe because of the trylock code described above.
So why do we even need to cancel anything if it doesn't do anything while
the reset is in progress?
It still needs to be cancelled. The worker only aborts if it is actively
executing concurrently with the reset. It might not start to execute until
after the reset has completed. And there is presumably a reason why the
cancel is being called, a reason not necessarily related to resets at all.
Leaving the worker to run arbitrarily after the driver is expecting it to be
stopped will lead to much worse things than a fake lockdep splat, e.g. a use
after free pointer deref.
I was actually thinking why not leave things as they are and just
disable lockdep from CI. This doesn't look like a relevant report
to me.

Andi
Disable lockdep? The whole of lockdep? We absolutely do not want to disable
an extremely important deadlock testing infrastructure in our test
framework. That would be defeating the whole point of CI.

Potentially we could annotate this one particular scenario to suppress this
one particular error.  But it seems simpler and safer to just update the
code to not hit that scenario in the first place.
yes... lockdep is a debug tool and might provide false reports...
We need to have a great willingness to start fixing and hunting
debug lockdep's false positives (like this one, for instance).
That is how lockdep works. It's like a compiler warning. You have to fix 
them even if you think they don't matter. Because otherwise, when 
someone tries to turn warnings on, they drown in a sea of other people's 
unrelated garbage that they did not bother to fix. If lockdep is to be 
of any use at all then it must be run regularly as part of a CI type 
system and any issues it finds must be fixed up by the developer's that 
own the relevant code. Where fixing means either fixing genuine bugs, 
re-working the code to not hit a false positive or annotating the code 
to explain to lockdep why it is a safe operation.

It's even more annoying to reduce our CI pass rates, especially
when in BAT tests, with such false deadlocks.
Maybe. But it is even more annoying when you have a genuine locking 
issue that you don't notice because you have disabled lockdep and just 
have some random hang issue that is impossible to reproduce or debug.

It's the developer's responsibility to test its code with
debug_lockdep and fix all the potential deadlocks and ignore the
false ones.
You seem to have this backwards. Developers are not expected to run 
every possible test on every possible platform in every possible 
configuration. That is the job of CI.

John.

I sent a patch for this[*] already.

Andi

[*] https://gitlab.freedesktop.org/gfx-ci/i915-infra/-/merge_requests/128