Hi, On Tue, Dec 3, 2024 at 3:05 PM Brian Norris <briannorris@xxxxxxxxxxxx> wrote: > > Hi, > > On Wed, Nov 27, 2024 at 7:44 AM Doug Anderson <dianders@xxxxxxxxxxxx> wrote: > > On Wed, Nov 27, 2024 at 2:58 AM Pin-yen Lin <treapking@xxxxxxxxxxxx> wrote: > > > In commit 52250cbee7f6 ("mwifiex: use timeout variant for > > > wait_event_interruptible") it was noted that sometimes we seemed > > > to miss the signal that our host sleep settings took effect. A > > > 10 second timeout was added to the code to make sure we didn't > > > hang forever waiting. It appears that this problem still exists > > > and we hit the timeout sometimes for Chromebooks in the field. > > > > > > Recently on ChromeOS we've started setting the DPM watchdog > > > to trip if full system suspend takes over 10 seconds. Given > > > the timeout in the original patch, obviously we're hitting > > > the DPM watchdog before mwifiex gets a chance to timeout. > > The Kconfig default is 120 seconds, and it's hidden under > CONFIG_EXPERT. What makes you think 10 is a good value? (It sounds > pretty small for triggering panics.) The smallest I can see outside of > ChromeOS is 12 seconds, although 60 seconds is much more common. I > also happen to see other WiFi drivers have hit similar problems, but > they settled on 20 seconds (assuming a 60s timeout on other distros): > https://lore.kernel.org/linux-wireless/20230329162038.8637-1-kvalo@xxxxxxxxxx/ > https://git.kernel.org/linus/cf5fa3ca0552f1b7ba8490de40700bbfb6979b17 > > Technically, this Kconfig lets you set a value as small as 1 second. I > don't think we should work at reducing all timeouts to fit that > target. > > I could maybe approve lowering this one, but I'd also recommend > reconsidering your timeout value. And more questions below. That's fair. I guess having a 10 second timeout for full system suspend didn't seem totally crazy to me. If a system is taking more than 10 seconds to do a full system suspend then that seems like something is pretty broken. I guess it's somewhat like the same argument that the WiFi driver had for picking 10 seconds but applied to the whole system level, and I guess that's where we get into trouble. That made me think that even 5 seconds seems a bit long for any given driver to suspend. ...but yeah, it's squishy. Maybe the ChromeOS should change to 15 seconds for the DPM Watchdog timer and that's a better solution and leave the WiFi driver how it is? Another thought: I wonder if it's possible to be dynamic and somehow set the timeout as "max(10, dpm_watchdog_timeout / 2)". Not that I've checked to see if the mwifiex can actually query the DPM watchdog timeout... ;-) ...also, it sure seems like if we're going to choose a value so low that we shouldn't panic. All of our other watchdogs that panic aren't so short, so probably this one shouldn't be either. Maybe we could submit a patch to make the DPM watchdog just abort the suspend if that's not too hard (and if the power people accept it). > > > While we could increase the DPM watchdog in ChromeOS to avoid > > > this problem, it's probably better to simply decrease the > > > timeout. Any time we're waiting several seconds for the > > > firmware to respond it's likely that the firmware won't ever > > > respond. With that in mind, decrease the timeout in mwifiex > > > from 10 seconds to 5 seconds. > > > > > > Suggested-by: Doug Anderson <dianders@xxxxxxxxxxxx> > > > Signed-off-by: Pin-yen Lin <treapking@xxxxxxxxxxxx> > > > --- > > > > > > drivers/net/wireless/marvell/mwifiex/sta_ioctl.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > I believe Brian Norris is still anointed as the personally nominally > > in charge of mwifiex upstream (get_maintainer labels him as "odd" > > fixer, which seems awfully judgemental), so he should be CCed on > > fixes. Added him. > > I tend to see mwifiex patches even if I don't get CC'd, but thanks. I > wonder why get_maintainer(?) picked up Francesco properly but not me. > *shrug* > > > > diff --git a/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c b/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c > > > index e06a0622973e..f79589cafe57 100644 > > > --- a/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c > > > +++ b/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c > > > @@ -545,7 +545,7 @@ int mwifiex_enable_hs(struct mwifiex_adapter *adapter) > > > > > > if (wait_event_interruptible_timeout(adapter->hs_activate_wait_q, > > > adapter->hs_activate_wait_q_woken, > > > - (10 * HZ)) <= 0) { > > > + (5 * HZ)) <= 0) { > > > > Given that I suggested this fix, it should be no surprise: > > > > Reviewed-by: Douglas Anderson <dianders@xxxxxxxxxxxx> > > > > Upon sleeping on the idea, the only slight concern I have here is > > whether we'll trigger this timeout if we try to suspend while WiFi > > scanning is in progress. I don't have any actual evidence supporting > > this concern, but I remember many years ago when I used to deal with > > the WiFi drivers more often there were cases where suspend could be > > delayed if it happened while a scan was in progress. Maybe/hopefully > > that's fixed now, but I figured I'd at least bring it up in case it > > tickled anything in someone's mind... > > Scans should essentially have been canceled by now, but IIUC, the > driver doesn't really force the card to stop if it's in the middle of > a scan (I'm not 100% sure if it's possible). So it's possible that > pending scans could delay this step. > > I wonder what the distribution of (successful) timing is today. I'd > assume this typically take on the order of milliseconds, but do we > commonly see outliers? What if a system is fully loaded while > suspending? I would hope this doesn't affect things from the DPM watchdog, but I haven't researched. Hopefully the DPM watchdog starts after userspace is frozen so the system being fully loaded shouldn't matter? Things I could believe mattering are things like: * If memory is full and something in the suspend patch allocates a big chunk of memory then maybe (??) that could slow things down? * If lots of USB devices are connected that could slow things down. * If there are a large number of WiFi networks or APs or Bluetooth devices I could believe that could cause suspend problems. > Can you try testing (and gather timing numbers) when > suspending soon after initiating scans? It's hard to judge what the > lower limit of this timeout should really be without any numbers, just > like it's hard to judge whether your 10 second watchdog is reasonable. Pin-yen: is this something you could gather? > Also, for the record, since we might have to field regression reports > for other systems: what hardware (chip variant, FW version) are you > seeing problems on? Pin-yen: I'm assuming you'll provide this. -Doug