Hi Brian, On Wed, Dec 4, 2024 at 10:04 AM Brian Norris <briannorris@xxxxxxxxxxxx> wrote: > > Hi Doug, > > On Tue, Dec 03, 2024 at 05:38:55PM -0800, Doug Anderson wrote: > > On Tue, Dec 3, 2024 at 3:05 PM Brian Norris <briannorris@xxxxxxxxxxxx> wrote: > > > The Kconfig default is 120 seconds, and it's hidden under > > > CONFIG_EXPERT. What makes you think 10 is a good value? (It sounds > > > pretty small for triggering panics.) The smallest I can see outside of > > > ChromeOS is 12 seconds, although 60 seconds is much more common. I > > > also happen to see other WiFi drivers have hit similar problems, but > > > they settled on 20 seconds (assuming a 60s timeout on other distros): > > > https://lore.kernel.org/linux-wireless/20230329162038.8637-1-kvalo@xxxxxxxxxx/ > > > https://git.kernel.org/linus/cf5fa3ca0552f1b7ba8490de40700bbfb6979b17 > > > > > > Technically, this Kconfig lets you set a value as small as 1 second. I > > > don't think we should work at reducing all timeouts to fit that > > > target. > > > > > > I could maybe approve lowering this one, but I'd also recommend > > > reconsidering your timeout value. And more questions below. > > > > That's fair. I guess having a 10 second timeout for full system > > suspend didn't seem totally crazy to me. If a system is taking more > > than 10 seconds to do a full system suspend then that seems like > > something is pretty broken. I guess it's somewhat like the same > > argument that the WiFi driver had for picking 10 seconds but applied > > to the whole system level, and I guess that's where we get into > > trouble. That made me think that even 5 seconds seems a bit long for > > any given driver to suspend. ...but yeah, it's squishy. > > 10 seconds is likely that *something* is wrong (or at least suboptimal), > but IMO, it's not quite at unreasonable levels. But yes, my point was > mainly that it's squishy, and I personally wouldn't want to be the one > running with the lowest CONFIG_DPM_WATCHDOG_TIMEOUT out there, given the > known behavior of multiple drivers and the timeout-means-panic behavior. > > > Maybe the ChromeOS should change to 15 seconds for the DPM Watchdog > > timer and that's a better solution and leave the WiFi driver how it > > is? > > That seems reasonable. > > To be clear, I'm OK with this patch, if we can get a little more > confidence in it (like the timing data and HW info). I *think* 5 vs 10 > isn't a big deal here, but I don't have much other than my guess at the > moment. > > > Another thought: I wonder if it's possible to be dynamic and somehow > > set the timeout as "max(10, dpm_watchdog_timeout / 2)". Not that I've > > You probably meant min()? > > > checked to see if the mwifiex can actually query the DPM watchdog > > timeout... ;-) > > Yeah, I wondered similarly -- or in reverse, if we could somehow "pat" > the watchdog or prime it with an expected timeout. But AFAICT, neither > such feature exists today. > > > ...also, it sure seems like if we're going to choose a value so low > > that we shouldn't panic. All of our other watchdogs that panic aren't > > so short, so probably this one shouldn't be either. Maybe we could > > submit a patch to make the DPM watchdog just abort the suspend if > > that's not too hard (and if the power people accept it). > > Yeah, if you made the watchdog just interrupt suspend and dump some > warnings, then the effect would be pretty similar to this patch. > > > > I wonder what the distribution of (successful) timing is today. I'd > > > assume this typically take on the order of milliseconds, but do we > > > commonly see outliers? What if a system is fully loaded while > > > suspending? > > > > I would hope this doesn't affect things from the DPM watchdog, but I > > haven't researched. Hopefully the DPM watchdog starts after userspace > > is frozen so the system being fully loaded shouldn't matter? > > I was just throwing out ideas, but I didn't specifically mean user > space. You provided a few more ideas. Anyway, I was just fishing for > *some* attempt at real-world (and, as-bad-as-you-can-simulate world) > numbers, since that's the point of a timeout like this. > > > > Can you try testing (and gather timing numbers) when > > > suspending soon after initiating scans? It's hard to judge what the > > > lower limit of this timeout should really be without any numbers, just > > > like it's hard to judge whether your 10 second watchdog is reasonable. > > > > Pin-yen: is this something you could gather? I tried entering suspend right after wifi scans, and the time spent in mwifiex_enable_hs() is always around 100ms. It seems initiating suspend does not increase the execution time for mwifiex_enable_hs(), so I think the driver is capable of interrupting a scan. > > > > > > > Also, for the record, since we might have to field regression reports > > > for other systems: what hardware (chip variant, FW version) are you > > > seeing problems on? > > > > Pin-yen: I'm assuming you'll provide this.