Re: Regression on drm-tip

Lucas De Marchi <lucas.demarchi@xxxxxxxxx> · Sat, 22 Mar 2025 15:59:44 -0500

On Mon, Mar 17, 2025 at 12:04:40PM +0800, Baolu Lu wrote:
On 3/16/25 18:01, Borah, Chaitanya Kumar wrote:

-----Original Message-----
From: Baolu Lu<baolu.lu@xxxxxxxxxxxxxxx>
Sent: Sunday, March 16, 2025 1:33 PM
To: Borah, Chaitanya Kumar<chaitanya.kumar.borah@xxxxxxxxx>
Cc:intel-gfx@xxxxxxxxxxxxxxxxxxxxx;intel-xe@xxxxxxxxxxxxxxxxxxxxx;
iommu@xxxxxxxxxxxxxxx; Kurmi, Suresh Kumar
<suresh.kumar.kurmi@xxxxxxxxx>; Saarinen, Jani<jani.saarinen@xxxxxxxxx>;
De Marchi, Lucas<lucas.demarchi@xxxxxxxxx>
Subject: Re: Regression on drm-tip

On 3/16/25 15:27, Borah, Chaitanya Kumar wrote:
-----Original Message-----
From: Baolu Lu<baolu.lu@xxxxxxxxxxxxxxx>
Sent: Sunday, March 16, 2025 8:04 AM
To: Borah, Chaitanya Kumar<chaitanya.kumar.borah@xxxxxxxxx>
Cc:intel-gfx@xxxxxxxxxxxxxxxxxxxxx;intel-xe@xxxxxxxxxxxxxxxxxxxxx;
iommu@xxxxxxxxxxxxxxx
Subject: Re: Regression on drm-tip

On 3/14/25 17:04, Borah, Chaitanya Kumar wrote:
-----Original Message-----
From: Baolu Lu<baolu.lu@xxxxxxxxxxxxxxx>
Sent: Thursday, March 13, 2025 7:53 PM
To: Borah, Chaitanya Kumar<chaitanya.kumar.borah@xxxxxxxxx>
Cc:baolu.lu@xxxxxxxxxxxxxxx;intel-gfx@xxxxxxxxxxxxxxxxxxxxx; intel-
xe@xxxxxxxxxxxxxxxxxxxxx;iommu@xxxxxxxxxxxxxxx
Subject: Re: Regression on drm-tip

On 2025/3/13 16:51, Borah, Chaitanya Kumar wrote:
Hello Lu,

Hope you are doing well. I am Chaitanya from the linux graphics
team in
Intel.
This mail is regarding a regression we are seeing in our CI
runs[1] on drm-tip
repository.
``````````````````````````````````````````````````````````````````
`` `` ``````````` <4>[    2.856622] WARNING: possible circular
locking dependency detected <4>[    2.856631]
6.14.0-rc5-CI_DRM_16217-gc55ef90b69d3+ #1 Tainted: G          I
<4>[ 2.856642]
------------------------------------------------------
<4>[    2.856650] swapper/0/1 is trying to acquire lock:
<4>[    2.856657] ffffffff8360ecc8
(iommu_probe_device_lock){+.+.}-{3:3}, at:
iommu_probe_device+0x1d/0x70 <4>[    2.856679]
                     but task is already holding lock:
<4>[    2.856686] ffff888102ab6fa8
(&device->physical_node_lock){+.+.}-{3:3}, at:
intel_iommu_init+0xea1/0x1220
``````````````````````````````````````````````````````````````````
``
``
```````````
Details log can be found in [2].

After bisecting the tree, the following patch [3] seems to be the
first "bad" commit

``````````````````````````````````````````````````````````````````
``
``
```````````````````````````````````
commit b150654f74bf0df8e6a7936d5ec51400d9ec06d8
Author:LuBaolumailto:baolu.lu@xxxxxxxxxxxxxxx
Date:   Fri Feb 28 18:27:26 2025 +0800

       iommu/vt-d: Fix suspicious RCU usage

``````````````````````````````````````````````````````````````````
``
``
```````````````````````````````````

We also verified that if we revert the patch the issue is not seen.

Could you please check why the patch causes this regression and
provide a
fix if necessary?

Can you please take a quick test to check if the following fix works?

diff --git a/drivers/iommu/intel/dmar.c
b/drivers/iommu/intel/dmar.c index
e540092d664d..06debeaec643 100644
--- a/drivers/iommu/intel/dmar.c
+++ b/drivers/iommu/intel/dmar.c
@@ -2051,8 +2051,13 @@ int enable_drhd_fault_handling(unsigned int
cpu)
                   if (iommu->irq || iommu->node != cpu_to_node(cpu))
                           continue;

+               /*
+                * Call dmar_alloc_hwirq() with dmar_global_lock held,
+                * could cause possible lock race condition.
+                */
+               up_read(&dmar_global_lock);
                   ret = dmar_set_interrupt(iommu);
-
+               down_read(&dmar_global_lock);
                   if (ret) {
                           pr_err("DRHD %Lx: failed to enable
fault, interrupt, ret
%d\n",
                                  (unsigned long
long)drhd->reg_base_addr, ret);

Thanks,
baolu
We still see the issue with this change.
I am attempting to reproduce this issue with my MTL machine. I pulled
the test branch from:

https://anongit.freedesktop.org/git/drm-tip.git

and built the test kernel image using the configuration file from:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_16217/kconfig.txt

But I did not observe the lockdep splat mentioned above after booting.

Is there anything I might have missed?

+Suresh, Jani, Lucas

We are seeing this only the skykale and kabylake on our CI runs.
If so, will below change make any difference?

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 85aa66ef4d61..ec2f385ae25b 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -3049,6 +3049,7 @@ static int __init
probe_acpi_namespace_devices(void)
                         if (dev->bus != &acpi_bus_type)
                                 continue;

+                       up_read(&dmar_global_lock);
                         adev = to_acpi_device(dev);
                         mutex_lock(&adev->physical_node_lock);
                         list_for_each_entry(pn, @@ -3058,6 +3059,7 @@ static int __init
probe_acpi_namespace_devices(void)
                                         break;
                         }
                         mutex_unlock(&adev->physical_node_lock);
+                       down_read(&dmar_global_lock);

                         if (ret)
                                 return ret;

Thank you for the change. This seems to be working. Can we expect a fix patch soon?

Sure. I have posted a fix patch here,

https://lore.kernel.org/linux-iommu/20250317035714.1041549-1-baolu.lu@xxxxxxxxxxxxxxx/

Thanks. FWIW I added this patch to our test branch in CI and the issue
is indeed not reproducing anymore.

Lucas De Marchi