[+cc Emmanuel] On Sun, Nov 03, 2024 at 01:52:24PM +0100, Jan Šídlo wrote: > ... Thanks for the report! Also, thanks for mentioning the bugzilla report here on the mailing list, since most subsystems don't actively monitor bugzilla. > I'm trying to hunt down few issues with my new-ish HP ZBook not > wanting to go to deeper C-stsates, which is kind of painful for a > laptop (battery drain is ~5-10%/hour). For this I created a little > python script that gathers all the info about all the components > from the system and periodically reports the status (every 3s or so) > including PCI and USB devices. To gather some information > (specifically about ASPM) I'm reading /config file for each PCI > device in /sys device tree and parsing it. I'm not reading only > /config but it is a prime suspect, because I excluded WLAN card from > this reading routine and the crash took much longer to occur - hours > instead of minutes. > > When I run this script, the IWL subsystem crashes after some time > (minutes to hours). There is clearly something going on the PCI bus > that I don't really understand. Since the error I get from IWL is > changing, I suspect there is some kind of race condition that is > triggered by my script. I opened a bug [1] and after some back and > forth with Emmanuel Grumbach, he said that this kind of error is > caused by IWL not being able to talk to the WLAN device (at all) and > to try to get your opinion on the matter :) It *should* be safe to read "config" from sysfs at any time, and also to write to the ASPM "policy" module parameter file at any time, but there could be bugs there. When you say "crash", I guess you mean all the iwlwifi error logging and the WARN_ON() stacktraces, right? I don't see an actual oops or panic in the logs yet. I assume none of these happen unless you are running your script or writing the "policy" parameter? Does the problem happen if you *only* run your script to scrape the info from "config"? What about if you *only* update the "policy" parameter? Emmanuel is right; the iwlwifi logging (e.g., "iwlwifi 0000:04:00.0: 0xFFFFFFFF | ADVANCED_SYSASSERT") sure looks like reads from the device are failing so we get ~0 data. I'm guessing those come from a BAR, so the BAR could be disabled or the device might not be responding e.g., if it is in a low-power state (D1, D2, D3hot, D3cold) or being reset. I don't know whether iwlwifi checks for any PCIe failures like this. I see iwl_trans_is_hw_error_value(), but that must be for some iwlwifi-specific error thing, not for PCIe errors, because it checks for things like 0xa5a5a5a0. For PCIe errors, we would see ~0 (0xffffffff). My guess is that all the other WARN()s and stacktraces are just a consequence of trying to do things to a device that isn't responding. > I have tried two different kernel versions (6.11.5 and 6.10.10), two > different WLAN cards (BE200NGW and AX211NGW) and multiple versions > of firmware for the cards. The error is still present, so I would > say I'd need to dig deeper, but I'm not really familiar with PCI > subsystem and how to debug it efficiently given the amount of data > going through. > > What can I do to debug this issue further? > > 1 - https://bugzilla.kernel.org/show_bug.cgi?id=219457 Any clue if this is a regression? Seems like a common device and we should have lots of reports. That would suggest something related to scraping the "config" file or updating ASPM "policy" at runtime. So I'd say the first step is to confirm that one or both of those is implicated. Bjorn