On Tue, Nov 05, 2024 at 01:24:59AM +0100, Jan Šídlo wrote: > On Mon, 2024-11-04 at 17:33 -0600, Bjorn Helgaas wrote: > > It *should* be safe to read "config" from sysfs at any time, and > > also to write to the ASPM "policy" module parameter file at any > > time, but there could be bugs there. > > > > When you say "crash", I guess you mean all the iwlwifi error > > logging and the WARN_ON() stacktraces, right? I don't see an > > actual oops or panic in the logs yet. > > There is no crash in form of an oops from the kernel fortunately :) > But the WLAN card stops talking & IWL driver is not able to recover. > Only shutdown fixes the issue. I did not try just reboot to be > honest as I thought that full powercycle is necessary to properly > reset the device - but I can try tomorrow if necessary. > > > I assume none of these happen unless you are running your script > > or writing the "policy" parameter? Does the problem happen if you > > *only* run your script to scrape the info from "config"? What > > about if you *only* update the "policy" parameter? > > The error does not happen if I read the config - I tested that > properly. Without touching the ASPM policy the script is able to run > without any problems. And also I can trigger the bug immediately > when I write "powersave" to the ASPM policy without the script. Perfect, thanks for narrowing that down! > > Emmanuel is right; the iwlwifi logging (e.g., "iwlwifi > > 0000:04:00.0: 0xFFFFFFFF | ADVANCED_SYSASSERT") sure looks like > > reads from the device are failing so we get ~0 data. I'm guessing > > those come from a BAR, so the BAR could be disabled or the device > > might not be responding e.g., if it is in a low-power state (D1, > > D2, D3hot, D3cold) or being reset. > > Device is reported being in D0 through the sysfs, but I'm not sure > if that is really correct, because if I do echo 1 > remove and > rescan, the kernel complains about not being able to talk to the > device. I can get the exact error tomorrow if you'd like. It's unavoidably racy to read the current state from config space. But since you've identified the write to "policy" in pcie_aspm_set_policy() as the critical item, I think that's the place to look. We had some recent issues related to configuring ASPM while the device was in a low-power state, e.g., https://lore.kernel.org/linux-pci/20240130163519.GA521777@bhelgaas/ While pcie_config_aspm_link() does check dev->current_state, I don't see anything that would prevent the power management framework from changing the power state while we're configuring devices to match the new ASPM state. Bjorn