Re: [RFC PATCH] drm/i915/gt: Do not treat MCR locking timeouts as errors

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 10/4/2023 13:09, Andi Shyti wrote:
Hi John,

The MCR steering semaphore is a shared lock entry between i915
and various firmware components.

Getting the lock might sinchronize on some shared resources.
Sometimes though, it might happen that the firmware forgets to
unlock causing unnecessary noise in the driver which keeps doing
what was supposed to do, ignoring the problem.

Do not consider this failure as an error, but just print a debug
message stating that the MCR locking has been skipped.

On the driver side we still have spinlocks that make sure that
the access to the resources is serialized.

Signed-off-by: Andi Shyti <andi.shyti@xxxxxxxxxxxxxxx>
Cc: Jonathan Cavitt <jonathan.cavitt@xxxxxxxxx>
Cc: Matt Roper <matthew.d.roper@xxxxxxxxx>
Cc: Nirmoy Das <nirmoy.das@xxxxxxxxx>
---
     drivers/gpu/drm/i915/gt/intel_gt_mcr.c | 6 ++----
     1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
index 326c2ed1d99b..51eb693df39b 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
@@ -395,10 +395,8 @@ void intel_gt_mcr_lock(struct intel_gt *gt, unsigned long *flags)
     	 * would indicate some hardware/firmware is misbehaving and not
     	 * releasing it properly.
     	 */
-	if (err == -ETIMEDOUT) {
-		gt_err_ratelimited(gt, "hardware MCR steering semaphore timed out");
-		add_taint_for_CI(gt->i915, TAINT_WARN);  /* CI is now unreliable */
-	}
+	if (err == -ETIMEDOUT)
+		gt_dbg(gt, "hardware MCR steering semaphore timed out");
     }
     /**
Are we sure this does not warrant a level higher than dbg, such as
notice/warn?
We might make it a warn, but this doesn't change much the economy
of the driver as we will keep doing what we were supposed to do.

Because how can we be sure the two entities will not stomp on
each other toes if we failed to obtain lock?
So far, in all the research I've done, no one looks like using
MCR lock, but yet someone is stuck in it.
If someone has the lock then that someone thinks they are using it. Just
because you can't see what someone piece of IFWI is doing doesn't mean it
isn't doing it. And if it is a genuinely missing unlock then it needs to be
tracked down and fixed with an IFWI update otherwise the system is going to
be unstable from that point on.
But I'm not changing here the behavior of the driver. The driver
will keep doing what was doing before.
And what it is doing is dangerous and can lead to unpredictable results
because a critical section resource access is no longer wrapped with a
critical section lock. Hence there is an error message to say Here Be
Dragons.
hehe!

What are you suggesting, then? Should we reset the GT after
hitting the MCR lock issue?
No. I'm suggesting that you don't hide the issue by removing the error message. It is there for a reason. Just because that reason is being hit doesn't mean you should remove the message.


We could, but I rather prefer to work with what's available.

Because this most probably won't be noticed by the user, then I
don't see why it should shout out loud that the system is
unusable when most probably it is.
Just because a race condition is hard to hit doesn't mean it won't be hit.
We are hitting it, constantly, but it's not producing any effect.
Even when raising the MCR wait timeout. Which means that most
probably someone is having a nap on the lock.
No. You are hitting the lock contention problem constantly and so far are not seeing any *visible* effect.

The point is that there is still a potential race condition which you haven't hit yet which could lead to data corruption, page faults, crashes, etc. (because a TLB invalidation access went to the wrong target) or the CPU/GPU melting itself of the board (because a power management access went to the wrong target).


The point of shouting out loud is that we know for a fact a problem has
occurred. That problem might lead to nothing or it might lead to subtle data
corruption, gross crashes or anything in between.
yes, agree... the message still stays, though, with a bit of a
lower catastrophy.

BTW, at my understanding this is not an IFWI problem. We checked
with different version of IFWI and the issue is the same.
Which implies it is a driver bug after all? In which case you absolutely
should not be papering over it in the driver.
This section is serialized by a spinlock and besides I haven't
found any place else except for the TLB invalidation and the
resume where we can incur a locking situation.
There is a bug somewhere. Either it is IFWI or it is KMD. You can't say "I can't find a problem therefore there is no problem".

John.



Thanks a lot for your inputs and discussion!
Andi




[Index of Archives]     [AMD Graphics]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux