Hi Lucas,
I figue out the reason of pci_bridge_secondary_bus_reset() why not work
for NVIDIA GPU T4
which bind vfio passthrough hypervisor. I used the original func
pci_bridge_secondary_bus_reset()
not your patch, your patch remove bridge_d3 flag, the real reason is
bridge_d3 flag.
However, pci_bridge_wait_for_secondary_bus() bails out if the bridge_d3
flag is not set. That flag indicates whether a bridge is allowed to
suspend to D3cold at *runtime*.
drivers/pci/pci.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index fba95486caaf..f43f3e84f634 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4964,7 +4964,7 @@ void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev)
if (pci_dev_is_disconnected(dev))
return;
- if (!pci_is_bridge(dev) || !dev->bridge_d3)
+ if (!pci_is_bridge(dev))
return;
down_read(&pci_bus_sem);
When I test the original func pci_bridge_secondary_bus_reset() in
different machine node
which all consist of the same type NVIDIA GPU T4, I found
pci_bridge_wait_for_secondary_bus()
bails out if the bridge_d3 flag is not set, but I still confused why
same gpu some machine node not set
the bridge_d3 flag.
I find the linux kernel only two func will init bridge_d3 which is func
pci_pm_init() and pci_bridge_d3_update().
If you know, please give me some hint.
On 2023/1/15 16:20, Lukas Wunner wrote:
If a PCI bridge is suspended to D3cold upon entering system sleep,
resuming it entails a Fundamental Reset per PCIe r6.0 sec 5.8.
The delay prescribed after a Fundamental Reset in PCIe r6.0 sec 6.6.1
is sought to be observed by:
pci_pm_resume_noirq()
pci_pm_bridge_power_up_actions()
pci_bridge_wait_for_secondary_bus()
However, pci_bridge_wait_for_secondary_bus() bails out if the bridge_d3
flag is not set. That flag indicates whether a bridge is allowed to
suspend to D3cold at *runtime*.
Hence *no* delay is observed on resume from system sleep if runtime
D3cold is forbidden. That doesn't make any sense, so drop the bridge_d3
check from pci_bridge_wait_for_secondary_bus().
The purpose of the bridge_d3 check was probably to avoid delays if a
bridge remained in D0 during suspend. However the sole caller of
pci_bridge_wait_for_secondary_bus(), pci_pm_bridge_power_up_actions(),
is only invoked if the previous power state was D3cold. Hence the
additional bridge_d3 check seems superfluous.
Fixes: ad9001f2f411 ("PCI/PM: Add missing link delays required by the PCIe spec")
Tested-by: Ravi Kishore Koppuravuri<ravi.kishore.koppuravuri@xxxxxxxxx>
Signed-off-by: Lukas Wunner<lukas@xxxxxxxxx>
Reviewed-by: Mika Westerberg<mika.westerberg@xxxxxxxxxxxxxxx>
Reviewed-by: Kuppuswamy Sathyanarayanan<sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx>
Cc:stable@xxxxxxxxxxxxxxx # v5.5+
---
Changes v1 -> v2:
* Add Reviewed-by tags (Mika, Sathyanarayanan)
drivers/pci/pci.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index fba95486caaf..f43f3e84f634 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4964,7 +4964,7 @@ void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev)
if (pci_dev_is_disconnected(dev))
return;
- if (!pci_is_bridge(dev) || !dev->bridge_d3)
+ if (!pci_is_bridge(dev))
return;
down_read(&pci_bus_sem);
Below is reply for your another email about the device invoke secondary
bus reset.
Yes, as your previous email describes the endpoint can not reset its
secondary bus.
Normally pci_bridge_secondary_bus_reset() should never be executed for
an endpoint device such as the Nvidia GPU T4. It should only be executed
for one of the bridges above the GPU. An endpoint cannot reset its
secondary bus, it doesn't have one.
As previous email I give the pci topology for the NVIDIA GPU T4, here I
copy as below,
you say that means:
0000:17:00.0 = Root Port
0000:18:00.0 = Switch Upstream Port of PLX 9797
0000:19:04.0 = Switch Downstream Port of PLX 9797
0000:19:08.0 = Switch Downstream Port of PLX 9797
0000:4f:00.0 = Nvidia GPU T4
I apply your debug patch, the pci_info print the device is 0000:19:08.0
which is the parent of
the endpoint device 0000:4f:00.0.
pci_parent_bus_reset() invokes pci_bridge_secondary_bus_reset() on the
parent of the device to be reset. In this case that's the Switch Downstream
Port 0000:19:08.0. It finds the parent by following dev->bus->self.
Perhaps you can apply the small debug patch below. It should print a message
when entering pci_bridge_secondary_bus_reset() and the message contains the
pci_name() of the device for which the function is executed. If this is
not the Switch Downstream Port, you've found a bug.
Usually a better way of finding the parent device is to call
pci_upstream_bridge() instead of following dev->bus->self. As you can
see in the definition of pci_upstream_bridge() in include/linux/pci.h,
this will find the parent of pci_phys_fn(dev). So in virtualization
scenarios, the result may indeed be different from dev->bus->self,
but I still don't understand how. You may want to try replacing
dev->bus->self with pci_upstream_bridge(dev) in pci_parent_bus_reset()
and see if that fixes the issue for you.
Thanks,
Lukas
-- >8 --
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 19fe0ef0e583..f383e5d29bb1 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5073,6 +5073,9 @@ void __weak pcibios_reset_secondary_bus(struct pci_dev *dev)
*/
int pci_bridge_secondary_bus_reset(struct pci_dev *dev)
{
+ pci_info(dev, "%s\n", __func__);
+ dump_stack();
+
pcibios_reset_secondary_bus(dev);
return pci_bridge_wait_for_secondary_bus(dev, "bus reset",
Thanks,
Yang