Re: [PATCH] wifi: mwifiex: decrease timeout waiting for host sleep from 10s to 5s

Doug Anderson <dianders@xxxxxxxxxxxx> · Fri, 6 Dec 2024 08:09:49 -0800

Hi,

On Fri, Dec 6, 2024 at 3:21 AM Pin-yen Lin <treapking@xxxxxxxxxxxx> wrote:
>
> Hi Doug,
>
> On Fri, Dec 6, 2024 at 1:13 AM Doug Anderson <dianders@xxxxxxxxxxxx> wrote:
> >
> > Hi,
> >
> > On Thu, Dec 5, 2024 at 5:46 AM Pin-yen Lin <treapking@xxxxxxxxxxxx> wrote:
> > >
> > > Hi Doug,
> > >
> > > On Thu, Dec 5, 2024 at 2:11 AM Doug Anderson <dianders@xxxxxxxxxxxx> wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Wed, Dec 4, 2024 at 5:45 AM Pin-yen Lin <treapking@xxxxxxxxxxxx> wrote:
> > > > >
> > > > > > > >  Can you try testing (and gather timing numbers) when
> > > > > > > > suspending soon after initiating scans? It's hard to judge what the
> > > > > > > > lower limit of this timeout should really be without any numbers, just
> > > > > > > > like it's hard to judge whether your 10 second watchdog is reasonable.
> > > > > > >
> > > > > > > Pin-yen: is this something you could gather?
> > > > >
> > > > > I tried entering suspend right after wifi scans, and the time spent in
> > > > > mwifiex_enable_hs() is always around 100ms. It seems initiating
> > > > > suspend does not increase the execution time for mwifiex_enable_hs(),
> > > > > so I think the driver is capable of interrupting a scan.
> > > > > > >
> > > > > > >
> > > > > > > > Also, for the record, since we might have to field regression reports
> > > > > > > > for other systems: what hardware (chip variant, FW version) are you
> > > > > > > > seeing problems on?
> > > > > > >
> > > > > > > Pin-yen: I'm assuming you'll provide this.
> > > > >
> > > > > From the debugfs entry:
> > > > >
> > > > > driver_name = "mwifiex"
> > > > > driver_version = mwifiex 1.0 (15.68.19.p54)
> > > > > verext = w8897o-B0, RF87XX, FP68, 15.68.19.p54
> > > > >
> > > > > The compatible string of the DT is "marvell,sd8897".
> > > > > >
> > > > > > I'll leave it up to y'all (Doug and Pin-Yen) whether you want to provide
> > > > > > the above to provide a little more confidence, or if you want to
> > > > > > reconsider your use of CONFIG_DPM_WATCHDOG_TIMEOUT.
> > > > >
> > > > > It looks okay to me to decrease the timeout here given that scanning
> > > > > doesn't seem to affect the suspend time. What's your thought, Doug?
> > > >
> > > > I think Brian is right and that we should change how we're using the
> > > > DPM watchdog, but IMO that doesn't mean that we couldn't also change
> > > > this timeout.
> > > >
> > > > If nothing else, you'd want to post a v2 of your patch containing a
> > > > summary of the info you've gathered so it's recorded with the patch.
> > > >
> > > > Maybe what makes the most sense, though, is to put our money where our
> > > > mouth is and land some sort of patch in the ChromeOS tree and then
> > > > report back to upstream in a month when we have data about it. If
> > > > things look good in the field then presumably that would give some
> > > > confidence for upstream to be willing to land the patch?
> > > >
> > > > Probably about the best data we could gather would be to break the
> > > > existing timeout into two halves. You could wait half the time, then
> > > > print a warning, and then wait the other half the time. We could even
> > > > land that change _without_ changing the timeout to 5 seconds. Then if
> > > > we ever see the warning print but then the suspend succeeds we know
> > > > that there are cases where waiting longer would have helped. If we
> > > > made that a WARN_ON() our existing infrastructure would help us gather
> > > > that info...
> > >
> > > I just realized that mwifiex_wait_queue_complete() has another 12s
> > > timeout[1], though they are not directly involved in suspend/resume.
> > >
> > > Should we also add a warning to that and see if we can make it half?
> > > This starts to make me think how many timeouts we want to consider.
> > > Does it make sense to only focus on the one in mwifiex_enable_hs()? Or
> > > should we check other timeouts in mwifiex or even other drivers?
> > >
> > > [1]: https://elixir.bootlin.com/linux/v6.12.1/source/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c#L51
> >
> > IMO any of these "arbitrary and really long timeout to make sure we
> > don't hang forever" type things probably should have a warning so we
> > know if we're getting close to the timeout. It wouldn't be very hard
> > to make a wrapper around wait_event_interruptible_timeout() to handle
> > this. I suppose that wrapper could just be local to mwifiex, though if
> > other drivers found it useful it would make sense to put it somewhere
> > more common.
> >
> > That being said, if we aren't actually hitting these other timeouts I
> > don't know that we need to do an extensive audit right now to find
> > every one of them.
> >
> > Of course, Brian also said he'd be OK with just setting the timeout to
> > 5 seconds based on the research you've already done. In that case I
> > don't know if we'd want a WARNing after 2.5 seconds. Maybe? 2.5
> > seconds is still pretty long, but I don't know if it's WARN-worthy. It
> > could at least be dev_warn() worthy...
> >
>
> Sounds like we will start to binary search the timeout...

I wouldn't bother binary searching the timeout and in general I
wouldn't try to tune these timeouts unless we think there's a real
benefit to it. Mostly, though, I like the idea of having a warning
_before_ a failure occurs. Like if I have a retry loop that will retry
10 times, I like to warn after I need 5 retries so that it's obvious
we're getting close. If I expect a timeout to never be needed, I'd
also like some warning that says we got close to the maximum timeout.
All of this makes it much easier to debug problems if/when we actually
hit the failure case. It also has the benefit that if we need to tune
a timeout / number of retries then we have more confidence that we
were nowhere near the max, but IMO that wouldn't be the primary
benefit.

> Then, I would like to just send out a v2 for this with an updated
> commit message.

Sure, though maybe it's not urgent? We've changed the DPM watchdog in
ChromeOS to 15 seconds so at least we won't be hitting it there. While
I agree that the 10 second timeout is still likely overkill and that 5
should be fine, gathering more data wouldn't hurt.

I'd be really curious how the system behaves after the timeout hits,
too. Does the next suspend actually succeed? Does WiFi recover or is
it busted until the next reboot?

> It sounds like it will be endless binary searches on
> all those timeouts in the kernel if we really want to create a wrapper
> around wait_event_interruptible_timeout(), and all the efforts only
> make us a few more seconds faster on the system.

My goal in general isn't to optimize the system but to know when we're
close to failure so it's easier to debug when the timeout occurs.