Lukas Wunner <lukas@xxxxxxxxx> 於 2024年9月27日 週五 下午5:28寫道: > > On Fri, Sep 27, 2024 at 03:33:50PM +0800, AceLan Kao wrote: > > Lukas Wunner <lukas@xxxxxxxxx> 2024-9-26 9:23 > > > On Thu, Sep 26, 2024 at 08:59:09PM +0800, Chia-Lin Kao (AceLan) wrote: > > > > Remove unnecessary pci_walk_bus() call in pciehp_resume_noirq(). This > > > > fixes a system hang that occurs when resuming after a Thunderbolt dock > > > > with attached thunderbolt storage is unplugged during system suspend. > > > > > > > > The PCI core already handles setting the disconnected state for devices > > > > under a port during suspend/resume. > > > > > > > > The redundant bus walk was > > > > interfering with proper hardware state detection during resume, causing > > > > a system hang when hot-unplugging daisy-chained Thunderbolt devices. > > > > I have no good answer for you now. > > After enabling some debugging options and debugging lock options, I > > still didn't get any message. > > Have you tried "no_console_suspend" on the kernel command line? > > > > ubuntu@localhost:~$ lspci -tv > > -[0000:00]-+-00.0 Intel Corporation Device 6400 > > +-02.0 Intel Corporation Lunar Lake [Intel Graphics] > > +-04.0 Intel Corporation Device 641d > > +-05.0 Intel Corporation Device 645d > > +-07.0-[01-38]-- > > +-07.2-[39-70]----00.0-[3a-70]--+-00.0-[3b]-- > > | +-01.0-[3c-4d]-- > > | +-02.0-[4e-5f]----00.0-[4f-50]----01.0-[50]----00.0 Phison Electronics Corporation E12 NVMe Controller > > | +-03.0-[60-6f]-- > > | \-04.0-[70]-- > > > > This is Dell WD22TB dock > > 39:00.0 PCI bridge [0604]: Intel Corporation Thunderbolt 4 Bridge [Goshen Ridge 2020] [8086:0b26] (rev 03) > > Subsystem: Intel Corporation Thunderbolt 4 Bridge [Goshen Ridge 2020] [8086:0000] > > > > This is the TBT storage connects to the dock > > 50:00.0 Non-Volatile memory controller [0108]: Phison Electronics > > Corporation E12 NVMe Controller [1987:5012] (rev 01) > > Subsystem: Phison Electronics Corporation E12 NVMe Controller [1987:5012] > > Kernel driver in use: nvme > > Kernel modules: nvme > > The lspci output shows another PCIe switch in-between the WD22TB dock and > the NVMe drive (bus 4e and 4f). Is that another Thunderbolt device? > Or is the NVMe drive built into the WD22TB dock and the switch at bus > 4e and 4f is a non-Thunderbolt PCIe switch in the dock? > > I realize now that commit 9d573d19547b ("PCI: pciehp: Detect device > replacement during system sleep") is a little overzealous because it > not only reacts to *replaced* devices but also to *unplugged* devices: > If the device was unplugged, reading the vendor and device ID returns > 0xffff, which is different from the cached value, so the device is > assumed to have been replaced even though it's actually been unplugged. > > The device replacement check runs in the ->resume_noirq phase. Later on > in the ->resume phase, pciehp_resume() calls pciehp_check_presence() to > check for unplugged devices. Commit 9d573d19547b inadvertantly reacts > before pciehp_check_presence() gets a chance to react. So that's something > that we should probably change. > > I'm not sure though why that would call a hang. But there is a known issue > that a deadlock may occur when hot-removing nested PCIe switches (which is > what you've got here). Keith Busch recently re-discovered the issue. > You may want to try if the hang goes away if you apply this patch: > > https://lore.kernel.org/all/20240612181625.3604512-2-kbusch@xxxxxxxx/ > > If it does go away then at least we know what the root cause is. Yes, the 2 patches work. > > The patch is a bit hackish, but there's an ongoing effort to tackle the > problem more thoroughly: > > https://lore.kernel.org/all/20240722151936.1452299-1-kbusch@xxxxxxxx/ > https://lore.kernel.org/all/20240827192826.710031-1-kbusch@xxxxxxxx/ v2 can't be applied clearly, so I made some changes. And this series doesn't work for me. > > Thanks, > > Lukas