On Thu, 2020-07-30 at 05:52 -0700, Ben Greear wrote: > > > > + if (++count > 1000) { > > > + /* WTF, bail out so that at least we don't hang the system. */ > > > + sdata_err(sdata, "Could not move state after 1000 tries, ret: %d state: %d\n", > > > + ret, sta->sta_state); > > > + WARN_ON_ONCE(1); > > > + break; > > > + } > > > > I guess that should be > > > > if (WARN_ON_ONCE()) ... > > If we spin 1000 times, it is worth a second warning. Or do you mean > the WARN_ON_ONCE(ret) should have if in front of it? Ah. I missed the WARN_ON_ONCE(ret) entirely. I just meant that the warning could/should be around the condition. In fact though, even the message probably should: if (WARN_ONCE(++count > 1000, "...", ...)) break; That way the message would be captured inside the warning, which is better for tooling that parses warnings. > > > > etc. > > > > > int err = drv_sta_state(sta->local, sta->sdata, sta, > > > sta->sta_state, new_state); > > > - if (err) > > > - return err; > > > + if (err == -EIO) { > > > + /* Sdata-not-in-driver, we are out of sync, but probably > > > + * best to carry on instead of bailing here, at least maybe > > > + * we can clean this up. > > > + */ > > > > It _could_ be the driver itself returning -EIO, so why not check the > > sdata-in-driver flag? > > Right, but if driver is complaining here, we need to bail out regardless of > sdata-in-driver or not, Yes. But I'm not sure we should WARN on that? > unless you think a driver could return EIO and then > a small bit later start working for the same request? Hah, no. If that's a possibility due to some stupid firmware reasons, let the driver deal with it. > > Really here that mostly applies to the commit log, which should probably > > say something like > > > > mac80211: deadlock due to driver misbehaviour > > > > or so, and then go on to explain what it does in *mac80211*, and show > > the iwlwifi parts only as an *example*. > > Its not really driver mis-behaviour per se. The root cause is that the > firmware crashes too badly for the driver to recover (ok, so driver might > could be better, but I've also seen cases where ath10k NIC falls off the PCI > bus, so nothing the driver can do in that case I think). Fair enough. We actually do have some code in there that tries to unbind/rebind the driver from the device eventually, but that's obviously a very last resort. FWIW, we do have multiple NICs in a single machine, but then we run them from VMs so each VM only has a single NIC. But I don't see why that should be different from the device/firmware point of view. Perhaps your PCIe configuration is different. > Per my other patches, I've seen this sdata-in-driver crap in the past, so > I think I probably hit a similar bug in both ax200 and ath10k, but since > ax200 is so easy to crash, it is much more likely to hit this bug than any > other driver I'm aware of. > > I'll try to re-word the commit message though, I don't really care what it > says so long as the code gets in. :) Thanks, johannes