Re: [PATCH] mac80211: Fix kernel hang on ax200 firmware crash.

Johannes Berg <johannes@xxxxxxxxxxxxxxxx> · Thu, 30 Jul 2020 15:11:10 +0200

On Thu, 2020-07-30 at 05:52 -0700, Ben Greear wrote:
> 
> > > +		if (++count > 1000) {
> > > +			/* WTF, bail out so that at least we don't hang the system. */
> > > +			sdata_err(sdata, "Could not move state after 1000 tries, ret: %d  state: %d\n",
> > > +				  ret, sta->sta_state);
> > > +			WARN_ON_ONCE(1);
> > > +			break;
> > > +		}
> > 
> > I guess that should be
> > 
> > if (WARN_ON_ONCE()) ...
> 
> If we spin 1000 times, it is worth a second warning.  Or do you mean
> the WARN_ON_ONCE(ret) should have if in front of it?

Ah. I missed the WARN_ON_ONCE(ret) entirely. I just meant that the
warning could/should be around the condition.

In fact though, even the message probably should:

	if (WARN_ONCE(++count > 1000, "...", ...))
		break;

That way the message would be captured inside the warning, which is
better for tooling that parses warnings.

> > 
> > etc.
> > 
> > >   		int err = drv_sta_state(sta->local, sta->sdata, sta,
> > >   					sta->sta_state, new_state);
> > > -		if (err)
> > > -			return err;
> > > +		if (err == -EIO) {
> > > +			/* Sdata-not-in-driver, we are out of sync, but probably
> > > +			 * best to carry on instead of bailing here, at least maybe
> > > +			 * we can clean this up.
> > > +			 */
> > 
> > It _could_ be the driver itself returning -EIO, so why not check the
> > sdata-in-driver flag?
> 
> Right, but if driver is complaining here, we need to bail out regardless of
> sdata-in-driver or not,

Yes. But I'm not sure we should WARN on that?

>  unless you think a driver could return EIO and then
> a small bit later start working for the same request?

Hah, no. If that's a possibility due to some stupid firmware reasons,
let the driver deal with it.

> > Really here that mostly applies to the commit log, which should probably
> > say something like
> > 
> > 	mac80211: deadlock due to driver misbehaviour
> > 
> > or so, and then go on to explain what it does in *mac80211*, and show
> > the iwlwifi parts only as an *example*.
> 
> Its not really driver mis-behaviour per se.  The root cause is that the
> firmware crashes too badly for the driver to recover (ok, so driver might
> could be better, but I've also seen cases where ath10k NIC falls off the PCI
> bus, so nothing the driver can do in that case I think).

Fair enough. We actually do have some code in there that tries to
unbind/rebind the driver from the device eventually, but that's
obviously a very last resort.

FWIW, we do have multiple NICs in a single machine, but then we run them
from VMs so each VM only has a single NIC. But I don't see why that
should be different from the device/firmware point of view. Perhaps your
PCIe configuration is different.

> Per my other patches, I've seen this sdata-in-driver crap in the past, so
> I think I probably hit a similar bug in both ax200 and ath10k, but since
> ax200 is so easy to crash, it is much more likely to hit this bug than any
> other driver I'm aware of.
> 
> I'll try to re-word the commit message though, I don't really care what it
> says so long as the code gets in.

:)

Thanks,
johannes