Re: [PATCH 3/3] PCI/DPC: Await readiness of secondary bus after reset

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jan 12, 2023 at 04:35:33PM -0600, Bjorn Helgaas wrote:
> On Sat, Dec 31, 2022 at 07:33:39PM +0100, Lukas Wunner wrote:
> > We're calling pci_bridge_wait_for_secondary_bus() after performing a
> > Secondary Bus Reset, but neglect to do the same after coming out of a
> > DPC-induced Hot Reset.  As a result, we're not observing the delays
> > prescribed by PCIe r6.0 sec 6.6.1 and may access devices on the
> > secondary bus before they're ready.  Fix it.
> > 
> > Tested-by: Ravi Kishore Koppuravuri <ravi.kishore.koppuravuri@xxxxxxxxx>
> 
> I assume this patch is the one that makes the difference for the
> Intel Ponte Vecchio HPC GPU?

Right.


> Is there a URL to a problem report, or
> at least a sentence or two we can include here to connect the patch
> with the problem users may see?

There's no public problem report.  My understanding is that Ponte Vecchio
was formally launched this Tuesday and mass distribution starts only now:

https://www.tomshardware.com/news/intel-launches-sapphire-rapids-fourth-gen-xeon-cpus-and-ponte-vecchio-max-gpu-series

The idea is to get the issue in the kernel fixed early so that users will
never even see it.


> Most people won't know how to
> recognize accesses to devices on the secondary bus before they're
> ready.

With Ponte Vecchio, the GPU is located below a PCIe switch and the
Downstream Port Containment happens at the Root Port.  So the Root
Port needs to wait for the Switch Upstream Port to re-appear.

Because config space is currently restored too early on the Switch
Upstream Port, it remains in D0uninitialized once it comes out of
reset, so all its registers, in particular the bridge windows,
are in power-on reset state.  As a result, anything downstream of it
(including the GPU) remains inaccessible and the user-visible
error messages look like this:

i915 0000:8c:00.0: can't change power state from D3cold to D0 (config space inaccessible)
intel_vsec 0000:8e:00.1: can't change power state from D3cold to D0 (config space inaccessible)

Where intel_vsec is a sibling of the GPU which is used for
telemetry I believe.

I'll be sure to include that additional information in the commit
message when respinning.

Thanks,

Lukas



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux