Re: diagnosing resume failures after disconnected USB4 drives (Was: Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac))

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




So I think, the failure mode may be related in some part to DP/Tunneling, too- I finally got another lockup (this time, after a hibernate, which I guess is some of the same facility) but what was different about this time where I couldn't reproduce the lockups (and what happens when I use my CalDigit dock) was I had an external USB-C monitor connected when I resumed, and when I'm home (where I sometimes forget to remove the NVMe USB4 adaptor) I always have my monitor connected to the dock.

See attached dump log. I'm using the (somewhat still experimental) Xe display driver, but I've seen this same lockup happen with i915.

In any case, I've now reverted 9d573d19, and when I get back to my CalDigit I can try instrumenting the code paths in the commit and see exactly where we're locking up.

-K

On 2/26/25 13:14, Kenneth Crudup wrote:
OK, just did a resume after suspended (for an hour, which somehow seems to matter) while my CalDigit dock was attached with the ASMedia NVMe adaptor at suspend, but both disconnected on resume, and I am indeed locked up.

I can attached the "pstore" report if necessary.

Unfortunately I won't be able to get back to the CalDigit until Saturday afternoon California time.

I'll be trying all the reverts/commits listed herein and at least check for regressions in other cases, though.

-Kenny

On 2/26/25 00:44, Mika Westerberg wrote:
Hi Kenneth,

On Fri, Feb 14, 2025 at 09:39:33AM -0800, Kenneth Crudup wrote:

This is excellent news that you were able to reproduce it- I'd figured this
regression would have been caught already (as I do remember this working
before) and was worried it may have been specific to a particular piece of
hardware (or software setup) on my system.

I'll see what I can dig up on my end, but as I'm not expert in these
subsystems I may not be able to diagnose anything until your return.

[Back now]

My git bisect ended up to this commit:

   9d573d19547b ("PCI: pciehp: Detect device replacement during system sleep")

Adding Lukas who is the expert.

My steps to reproduce on Intel Meteor Lake based reference system are:

1. Boot the system up, nothing connected.
2. Once up, connect Thunderbolt 4 dock and Thunderbolt 3 NVMe in a chain:

   [Meteor Lake host] <--> [TB 4 dock] <--> [TB 3 NVMe]

3. Authorize PCIe tunnels (whatever your distro provides, my buildroot just
     has the debugging tools so running 'tbauth -r 301')

4. Check that the PCIe topology matches the expected (lspci)

5. Enter s2idle:

   # rtcwake -s 30 -mmem

6. Once it is suspended, unplug the cable between the host and the dock.

7. Wait for the resume to happen.

Expectation: The system wakes up fine, notices that the TB and PCIe devices
are gone, stays responsive and usable.

Actual result: Resume never completes.

I added "no_console_suspend" to the command line and the did sysrq-w to
get list of blocked tasks. I've attached it just in case it is needed.

If I revert the above commit the issue is gone. Now I'm not sure if this is exactly the same issue that you are seeing but nevertheless this is kind of
normal use case so definitely something we should get fixed.

Lukas, if you need any more information let me know. I can reproduce this
easily.

I also saw some DRM/connected fixes posted to Linus' master so maybe one of them corrects this new display-crash issue (I'm not home on my big monitor
to be able to test yet).

-Kenny

On 2/14/25 08:29, Mika Westerberg wrote:
Hi,

On Thu, Feb 13, 2025 at 11:19:35AM -0800, Kenneth Crudup wrote:

On 2/13/25 05:59, Mika Westerberg wrote:

Hi,

As Murphy's would have it, now my crashes are display-driver related (this
is Xe, but I've also seen it with i915).

Attached here just for the heck of it, but I'll be better testing the NVMe
enclosure-related failures this weekend. Stay tuned!

Okay, I checked quickly and no TB related crash there but I was actually able to reproduce hang when I unplug the device chain during suspend. I did not yet have time to look into it deeper. I'm sure this has been working fine in the past as we tested all kinds of topologies including similar to
this.

I will be out next week for vacation but will continue after that if the
problem is not alraedy solved ;-)


--
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County
CA


--
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County CA

Attachment: pstore-202502262249.tar.bz2
Description: application/bzip


[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux