Hi, I have new findings - I was mistaken when I put the blame on PCI config of the device. After some more digging (and many reboots), I can now trigger the bug at will - all I have to do is write "powersave" to pcie_aspm policy: echo powersave | /sys/module/pcie_aspm/parameters/policy Few seconds after that I get the ol' crash: 04.11 21:54:03 root[11383]: Calling 'echo powersave | /sys/module/pcie_aspm/parameters/policy' 04.11 21:54:10 kernel: iwlwifi 0000:04:00.0: Error sending SYSTEM_STATISTICS_CMD: time out after 2000ms. 04.11 21:54:10 kernel: iwlwifi 0000:04:00.0: Current CMD queue read_ptr 309 write_ptr 310 04.11 21:54:10 kernel: iwlwifi 0000:04:00.0: Start IWL Error Log Dump: 04.11 21:54:10 kernel: iwlwifi 0000:04:00.0: Transport status: 0x0000004A, valid: -1 04.11 21:54:10 kernel: iwlwifi 0000:04:00.0: Loaded firmware version: 92.67ce4588.0 gl-c0-fm-c0-92.ucode .... I tested it multiple times and it is definitely consistent. I always thought that pcie_aspm/parameters/policy should be generally safe (as long as I use default/performance/powersave and not pcie_aspm=force on the cmdline). Am I mistaken and should this setting be avoided as unsafe? If not, can I somehow help to get to the bottom of this? Thanks Jan On Sun, 2024-11-03 at 13:52 +0100, Jan Šídlo wrote: > Hello, > > I'm not sure if this is the right place - if not, I'm sorry! It is the first time I'm trying to join a linux > mailing list so I may have missed something or I may have done something incorrectly. I'm not even sure if > this is the right way to send a message, but I have to start somewhere :) > > I'm trying to hunt down few issues with my new-ish HP ZBook not wanting to go to deeper C-stsates, which is > kind of painful for a laptop (battery drain is ~5-10%/hour). For this I created a little python script that > gathers all the info about all the components from the system and periodically reports the status (every 3s > or > so) including PCI and USB devices. To gather some information (specifically about ASPM) I'm reading /config > file for each PCI device in /sys device tree and parsing it. I'm not reading only /config but it is a prime > suspect, because I excluded WLAN card from this reading routine and the crash took much longer to occur - > hours instead of minutes. > > When I run this script, the IWL subsystem crashes after some time (minutes to hours). There is clearly > something going on the PCI bus that I don't really understand. Since the error I get from IWL is changing, I > suspect there is some kind of race condition that is triggered by my script. I opened a bug [1] and after > some > back and forth with Emmanuel Grumbach, he said that this kind of error is caused by IWL not being able to > talk > to the WLAN device (at all) and to try to get your opinion on the matter :) > > I have tried two different kernel versions (6.11.5 and 6.10.10), two different WLAN cards (BE200NGW and > AX211NGW) and multiple versions of firmware for the cards. The error is still present, so I would say I'd > need > to dig deeper, but I'm not really familiar with PCI subsystem and how to debug it efficiently given the > amount > of data going through. > > What can I do to debug this issue further? > > Thanks > Jan > > 1 - https://bugzilla.kernel.org/show_bug.cgi?id=219457 > >