OK, I may not be explaining the history properly, so more background:
(I tend to run Linus' master that I pull every few days, partially
'cause I like to see all the new fixes and features, and partially
'cause over the years I'll stumble over bugs and help the subsystems'
Maintainer(s) fix the problems.)
Anyway, late last year I'd notice lately (it wasn't happening before)
that once I'd get to the office, my laptop would be hard-hung on resume,
which I eventually traced back to having my NVMe adaptor connected to my
TB Dock when I suspended/hibernated. I'd started to try to bisect it,
but couldn't find a good starting point (or one too far back) and would
have to give up 'cause I'd run out of time. However, I'd mention the
issue in the mailing lists, hoping for a solution- and that's when you'd
discovered 9d573d19.
But between your NVMe discovery (and by this time I was mostly :(
careful about disconnecting the NVMe adaptor before suspend) and
sometime around the beginning of the year I was also getting occasional
hard-hangs on resume even if I hadn't had the NVMe adaptor connected on
suspend. I'd seen where the pstore dumps were pointing to the display
driver, so I'd switched back to the i915 from the xe driver, but that
hadn't fixed it either. In the meantime, having seen one of the OOPses
be in __tb_path_deactivate_hop(), I'd dropped some printks (actually
"tb_port_info()", I think) at various points printing the line# so I
could try and tell approximately where the crash occurred (yeah, I know
I need to get my ksymoops up and running :) ). I hadn't made the
correlation yet between having an external monitor connected or not, and
having seen a number of xe/i915/dp/Thunderbolt changes come thru, was
both hoping for the fix to be reported and corrected, or try and find
time and find out why it was happening via my tracing.
So in late February we'd had two failure modes for me in Linus' master:
- 9d573d19 (NVMe adaptor connected on suspend causing an OOPS on resume)
- d6d458d4 (OOPS if external USB-C DP monitor connected on resume)
I couldn't/didn't recognize the 2nd issue fully until you'd discovered
the cause of the first one.
At home I have a Samsung Odyssey monitor connected to a USB-C-to-DP 2.1
cable, to a TB port on a CalDigit TB4 dock.
My travel bag has a generic Chinese USB-C DP tunneling portable monitor
which is usually connected to a Plugable TB hub.
In any case, the resume failures happen with either one.
On 3/3/25 03:53, Mika Westerberg wrote:
I thought the system resumes fine after you reverted the other commit
(9d573d19), no? Just you don't get display tunneled so for example if you
login over ethernet (ssh) you should still be able to get full dmesg.
Nah, it usually hard hangs if a monitor is connected when I resume; has
to be power-cycled at that point.
We can actually take PCIe out of the equation so that you ask "boltctl" to
forget the device temporarily (or from the GNOME settings "privacy and
security" -> "Thunderbolt" then "forget device" for each). This means your
docks do not work fully but display should and then we hopefully can get
the dmesg.
Well my topology is almost always Laptop -> Dock -> Monitor .
This workflow came about ironically enough 'cause my client has given me
a MS Surface (Windows) machine with only one TB/USB-C port, and since I
will physically switch to using my own machine, to minimize setup
changes I just use the "one cable for all" approach (i.e., never
connecting the external monitor to the other TB port on my XPS-9320).
Oh and the failure mode for d6d458d4 is ALWAYS this, and always(?) from
line 436/7 of ".../drivers/thunderbolt/path.c", a call to tb_port_write() :
----
<4>[ 236.546634][ T4600] Oops: general protection fault, probably for
non-canonical address 0xba65fbf27d6de496: 0000 [#1] PREEMPT SMP
<4>[ 236.546646][ T4600] CPU: 7 UID: 0 PID: 4600 Comm: systemd-sleep
Tainted: G S U W 6.14.0-rc4-kenny+ #10
<4>[ 236.546655][ T4600] Tainted: [S]=CPU_OUT_OF_SPEC, [U]=USER, [W]=WARN
<4>[ 236.546657][ T4600] Hardware name: Dell Inc. XPS 9320/0KNXGD, BIOS
2.18.1 12/24/2024
<4>[ 236.546660][ T4600] RIP: 0010:__tb_path_deactivate_hop+0x11/0x49a
<4>[ 236.546673][ T4600] Code: f5 f5 db 00 5a 48 8d 65 e8 5b 41 5c 41
5d 5d c3 b8 ed ff ff ff c3 0f 1f 00 55 48 89 e5 41 57 41 56 41 55 41 54
53 48 83 ec 18 <4c> 8b 47 20 48 85 ff 65 4c 8b 34 25 28 00 00 00 4c 89
75 d0 49 89
<4>[ 236.546677][ T4600] RSP: 0018:ffffbe85080a77f0 EFLAGS: 00010286
<4>[ 236.546682][ T4600] RAX: ffff957ee8373a20 RBX: 0000000000000000
RCX: 0000000000000002
<4>[ 236.546686][ T4600] RDX: 000000000000007d RSI: 0000000011000010
RDI: ba65fbf27d6de476
<4>[ 236.546689][ T4600] RBP: ffffbe85080a7830 R08: 0000000000000000
R09: ffffffff84255760
<4>[ 236.546691][ T4600] R10: 0000000000000000 R11: 0000000000000000
R12: ffff957ee8373a00
<4>[ 236.546693][ T4600] R13: 0000000000000000 R14: ffffbe85080a78a0
R15: ffffbe85080a7820
<4>[ 236.546696][ T4600] FS: 00007f2fcaa4a940(0000)
GS:ffff9585af5c0000(0000) knlGS:0000000000000000
<4>[ 236.546700][ T4600] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[ 236.546703][ T4600] CR2: 0000000000000000 CR3: 00000001f0833002
CR4: 0000000000770ef0
<4>[ 236.546705][ T4600] PKRU: 55555554
<4>[ 236.546707][ T4600] Call Trace:
<4>[ 236.546710][ T4600] <TASK>
<4>[ 236.546713][ T4600] ? show_regs.part.0+0x1d/0x20
<4>[ 236.546722][ T4600] ? die_addr.cold+0x8/0xd
<4>[ 236.546729][ T4600] ? exc_general_protection+0x1c0/0x490
<4>[ 236.546740][ T4600] ? asm_exc_general_protection+0x27/0x30
<4>[ 236.546747][ T4600] ? __tb_path_deactivate_hop+0x11/0x49a
<4>[ 236.546754][ T4600] __tb_path_deactivate_hops.cold+0x2e/0xaa
<4>[ 236.546760][ T4600] tb_path_deactivate+0x1e/0x110
<4>[ 236.546769][ T4600] tb_tunnel_deactivate+0x65/0x120
<4>[ 236.546775][ T4600] tb_resume_noirq+0xc2/0x2a0
<4>[ 236.546779][ T4600] tb_domain_resume_noirq+0x3f/0x60
<4>[ 236.546787][ T4600] nhi_resume_noirq+0x34/0x90
<4>[ 236.546795][ T4600] pci_pm_restore_noirq+0x71/0xc0
<4>[ 236.546801][ T4600] ? new_id_store+0x1b0/0x1b0
<4>[ 236.546807][ T4600] dpm_run_callback+0x40/0xb0
<4>[ 236.546812][ T4600] device_resume_noirq+0xc4/0x2a0
<4>[ 236.546817][ T4600] dpm_noirq_resume_devices+0x11b/0x150
<4>[ 236.546822][ T4600] dpm_resume_start+0xc/0x30
<4>[ 236.546827][ T4600] hibernation_snapshot+0x26d/0x430
<4>[ 236.546835][ T4600] hibernate.cold+0x9c/0x333
<4>[ 236.546840][ T4600] state_store+0xbe/0xc0
<4>[ 236.546845][ T4600] kobj_attr_store+0xf/0x20
<4>[ 236.546854][ T4600] sysfs_kf_write+0x34/0x40
<4>[ 236.546861][ T4600] kernfs_fop_write_iter+0x134/0x1e0
<4>[ 236.546868][ T4600] vfs_write+0x244/0x410
<4>[ 236.546878][ T4600] ksys_write+0x63/0xd0
<4>[ 236.546885][ T4600] __x64_sys_write+0x14/0x20
<4>[ 236.546892][ T4600] x64_sys_call+0x9eb/0xa00
<4>[ 236.546899][ T4600] do_syscall_64+0x63/0xf0
<4>[ 236.546906][ T4600] ? do_syscall_64+0x6f/0xf0
<4>[ 236.546913][ T4600] ? do_filp_open+0xbe/0x170
<4>[ 236.546919][ T4600] ? from_kgid_munged+0xd/0x20
<4>[ 236.546924][ T4600] ? cp_new_stat+0x14a/0x180
<4>[ 236.546931][ T4600] ? do_wp_page+0x7f3/0xe80
<4>[ 236.546936][ T4600] ? ___pte_offset_map+0x17/0xe0
<4>[ 236.546944][ T4600] ? __handle_mm_fault+0xb13/0x1160
<4>[ 236.546951][ T4600] ? __count_memcg_events+0x49/0xe0
<4>[ 236.546956][ T4600] ? handle_mm_fault+0x181/0x2a0
<4>[ 236.546961][ T4600] ? irqentry_exit+0x4a/0x60
<4>[ 236.546964][ T4600] ? exc_page_fault+0x196/0x5c0
<4>[ 236.546972][ T4600] entry_SYSCALL_64_after_hwframe+0x4b/0x53
<4>[ 236.546977][ T4600] RIP: 0033:0x7f2fca926274
<4>[ 236.546984][ T4600] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66
2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d f5 2d 0f 00 00 74 13 b8 01
00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83
ec 20 48 89
<4>[ 236.546987][ T4600] RSP: 002b:00007ffec678fb58 EFLAGS: 00000202
ORIG_RAX: 0000000000000001
<4>[ 236.546992][ T4600] RAX: ffffffffffffffda RBX: 0000000000000005
RCX: 00007f2fca926274
<4>[ 236.546994][ T4600] RDX: 0000000000000005 RSI: 000055f4304eb730
RDI: 0000000000000007
<4>[ 236.546996][ T4600] RBP: 00007ffec678fb80 R08: 0000000000000000
R09: 0000000000000001
<4>[ 236.546998][ T4600] R10: 000055f4304eb720 R11: 0000000000000202
R12: 0000000000000005
<4>[ 236.547000][ T4600] R13: 000055f4304eb730 R14: 000055f4304e12a0
R15: 00007f2fcaa0fea0
<4>[ 236.547004][ T4600] </TASK>
<4>[ 236.547006][ T4600] Modules linked in: vmw_vmci btusb btintel
snd_soc_sof_sdw snd_soc_sdw_utils snd_sof_probes iwlmvm mei_hdcp mei_pxp
mac80211 snd_sof_pci_intel_tgl snd_sof_pci_intel_cnl
snd_sof_intel_hda_generic snd_sof_pci soundwire_intel
soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda_common
snd_soc_hdac_hda iwlwifi snd_sof_intel_hda_mlink snd_sof_intel_hda
mei_me cfg80211 ov01a10 xe drm_ttm_helper gpu_sched drm_suballoc_helper
drm_gpuvm drm_exec i915 drm_buddy intel_gtt drm_display_helper cec ttm
<4>[ 236.547061][ T4600] ---[ end trace 0000000000000000 ]---
----
-Kenny
--
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange
County CA