Re: IWL errors when reading PCI config through /sys

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I have new findings - I was mistaken when I put the blame on PCI config of the device. After some more digging
(and many reboots), I can now trigger the bug at will - all I have to do is write "powersave" to pcie_aspm
policy:
  echo powersave | /sys/module/pcie_aspm/parameters/policy

Few seconds after that I get the ol' crash:

04.11 21:54:03  root[11383]: Calling 'echo powersave | /sys/module/pcie_aspm/parameters/policy'
04.11 21:54:10  kernel: iwlwifi 0000:04:00.0: Error sending SYSTEM_STATISTICS_CMD: time out after 2000ms.
04.11 21:54:10  kernel: iwlwifi 0000:04:00.0: Current CMD queue read_ptr 309 write_ptr 310
04.11 21:54:10  kernel: iwlwifi 0000:04:00.0: Start IWL Error Log Dump:
04.11 21:54:10  kernel: iwlwifi 0000:04:00.0: Transport status: 0x0000004A, valid: -1
04.11 21:54:10  kernel: iwlwifi 0000:04:00.0: Loaded firmware version: 92.67ce4588.0 gl-c0-fm-c0-92.ucode
....

I tested it multiple times and it is definitely consistent.

I always thought that pcie_aspm/parameters/policy should be generally safe (as long as I use
default/performance/powersave and not pcie_aspm=force on the cmdline). Am I mistaken and should this setting
be avoided as unsafe?

If not, can I somehow help to get to the bottom of this?

Thanks
Jan


On Sun, 2024-11-03 at 13:52 +0100, Jan Šídlo wrote:
> Hello,
> 
> I'm not sure if this is the right place - if not, I'm sorry! It is the first time I'm trying to join a linux
> mailing list so I may have missed something or I may have done something incorrectly. I'm not even sure if
> this is the right way to send a message, but I have to start somewhere :)
> 
> I'm trying to hunt down few issues with my new-ish HP ZBook not wanting to go to deeper C-stsates, which is
> kind of painful for a laptop (battery drain is ~5-10%/hour). For this I created a little python script that
> gathers all the info about all the components from the system and periodically reports the status (every 3s
> or
> so) including PCI and USB devices. To gather some information (specifically about ASPM) I'm reading /config
> file for each PCI device in /sys device tree and parsing it. I'm not reading only /config but it is a prime
> suspect, because I excluded WLAN card from this reading routine and the crash took much longer to occur -
> hours instead of minutes.
> 
> When I run this script, the IWL subsystem crashes after some time (minutes to hours). There is clearly
> something going on the PCI bus that I don't really understand. Since the error I get from IWL is changing, I
> suspect there is some kind of race condition that is triggered by my script. I opened a bug [1] and after
> some
> back and forth with Emmanuel Grumbach, he said that this kind of error is caused by IWL not being able to
> talk
> to the WLAN device (at all) and to try to get your opinion on the matter :)
> 
> I have tried two different kernel versions (6.11.5 and 6.10.10), two different WLAN cards (BE200NGW and
> AX211NGW) and multiple versions of firmware for the cards. The error is still present, so I would say I'd
> need
> to dig deeper, but I'm not really familiar with PCI subsystem and how to debug it efficiently given the
> amount
> of data going through.
> 
> What can I do to debug this issue further?
> 
> Thanks
> Jan
> 
> 1 - https://bugzilla.kernel.org/show_bug.cgi?id=219457
> 
> 






[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux