Re: IWL errors when reading PCI config through /sys

Jan Šídlo <me@xxxxxxxxx> · Tue, 05 Nov 2024 01:24:59 +0100

On Mon, 2024-11-04 at 17:33 -0600, Bjorn Helgaas wrote:
> It *should* be safe to read "config" from sysfs at any time, and also
> to write to the ASPM "policy" module parameter file at any time, but
> there could be bugs there.
> 
> When you say "crash", I guess you mean all the iwlwifi error logging
> and the WARN_ON() stacktraces, right?   I don't see an actual oops or
> panic in the logs yet.

There is no crash in form of an oops from the kernel fortunately :) But the WLAN card stops talking & IWL
driver is not able to recover. Only shutdown fixes the issue. I did not try just reboot to be honest as I
thought that full powercycle is necessary to properly reset the device - but I can try tomorrow if necessary.

> I assume none of these happen unless you are running your script or
> writing the "policy" parameter?  Does the problem happen if you *only*
> run your script to scrape the info from "config"?  What about if you
> *only* update the "policy" parameter?

The error does not happen if I read the config - I tested that properly. Without touching the ASPM policy the
script is able to run without any problems. And also I can trigger the bug immediately when I write
"powersave" to the ASPM policy without the script.

> Emmanuel is right; the iwlwifi logging (e.g., "iwlwifi 0000:04:00.0:
> 0xFFFFFFFF | ADVANCED_SYSASSERT") sure looks like reads from the
> device are failing so we get ~0 data.  I'm guessing those come from a
> BAR, so the BAR could be disabled or the device might not be
> responding e.g., if it is in a low-power state (D1, D2, D3hot, D3cold)
> or being reset.

Device is reported being in D0 through the sysfs, but I'm not sure if that is really correct, because if I do
echo 1 > remove and rescan, the kernel complains about not being able to talk to the device. I can get the
exact error tomorrow if you'd like.

> I don't know whether iwlwifi checks for any PCIe failures like this.
> I see iwl_trans_is_hw_error_value(), but that must be for some
> iwlwifi-specific error thing, not for PCIe errors, because it checks
> for things like 0xa5a5a5a0.  For PCIe errors, we would see ~0
> (0xffffffff).
> 
> My guess is that all the other WARN()s and stacktraces are just a
> consequence of trying to do things to a device that isn't responding.

You are probably right, Emmanuel did mention similar thing - that these errors are typically thrown around
when the iwl driver cannot talk to the card (at all).

> > I have tried two different kernel versions (6.11.5 and 6.10.10), two
> > different WLAN cards (BE200NGW and AX211NGW) and multiple versions
> > of firmware for the cards. The error is still present, so I would
> > say I'd need to dig deeper, but I'm not really familiar with PCI
> > subsystem and how to debug it efficiently given the amount of data
> > going through.
> > 
> > What can I do to debug this issue further?
> > 
> > 1 - https://bugzilla.kernel.org/show_bug.cgi?id=219457
> 
> Any clue if this is a regression?  Seems like a common device and we
> should have lots of reports.  That would suggest something related to
> scraping the "config" file or updating ASPM "policy" at runtime.  So
> I'd say the first step is to confirm that one or both of those is
> implicated.

Updating policy definitely triggers this (at least on my ZBook) reliably. Scraping config is fine, I wrongly
thought it was the problem (mea culpa) :)

TLP (common power management tool for laptops) can also write this policy, but I don't know if it is enabled
by default (I believe it is not) so that might be the reason there are not lots of these reports.

I also believe that I did not encounter this error when I was using TLP on this laptop earlier - and I *had*
this policy enabled. However I stopped using TLP some time ago (unfortunately I can't remember when exactly)
and switched to a set of custom scripts when I was trying to hunt down the shallow cstates/bad S0ix residency.

I saw "queue Y is stuck" quite often, but these were not annoying enough to make me dig deeper as the iwl
would recover from this with just few second long outage.

I'm not sure yet if this is a regression, but I can test out few older kernel versions. Gentoo still has 5.10
and up so it should not be that hard to try out if you think it will help. :)

Thanks for your time!

Jan