On Thu, Jan 12, 2023 at 04:35:33PM -0600, Bjorn Helgaas wrote: > On Sat, Dec 31, 2022 at 07:33:39PM +0100, Lukas Wunner wrote: > > We're calling pci_bridge_wait_for_secondary_bus() after performing a > > Secondary Bus Reset, but neglect to do the same after coming out of a > > DPC-induced Hot Reset. As a result, we're not observing the delays > > prescribed by PCIe r6.0 sec 6.6.1 and may access devices on the > > secondary bus before they're ready. Fix it. > > > > Tested-by: Ravi Kishore Koppuravuri <ravi.kishore.koppuravuri@xxxxxxxxx> > > I assume this patch is the one that makes the difference for the > Intel Ponte Vecchio HPC GPU? Right. > Is there a URL to a problem report, or > at least a sentence or two we can include here to connect the patch > with the problem users may see? There's no public problem report. My understanding is that Ponte Vecchio was formally launched this Tuesday and mass distribution starts only now: https://www.tomshardware.com/news/intel-launches-sapphire-rapids-fourth-gen-xeon-cpus-and-ponte-vecchio-max-gpu-series The idea is to get the issue in the kernel fixed early so that users will never even see it. > Most people won't know how to > recognize accesses to devices on the secondary bus before they're > ready. With Ponte Vecchio, the GPU is located below a PCIe switch and the Downstream Port Containment happens at the Root Port. So the Root Port needs to wait for the Switch Upstream Port to re-appear. Because config space is currently restored too early on the Switch Upstream Port, it remains in D0uninitialized once it comes out of reset, so all its registers, in particular the bridge windows, are in power-on reset state. As a result, anything downstream of it (including the GPU) remains inaccessible and the user-visible error messages look like this: i915 0000:8c:00.0: can't change power state from D3cold to D0 (config space inaccessible) intel_vsec 0000:8e:00.1: can't change power state from D3cold to D0 (config space inaccessible) Where intel_vsec is a sibling of the GPU which is used for telemetry I believe. I'll be sure to include that additional information in the commit message when respinning. Thanks, Lukas