On Fri, May 22, 2020 at 2:51 PM Luis Chamberlain <mcgrof@xxxxxxxxxx> wrote: > > On Fri, May 22, 2020 at 10:46:07PM +0200, Johannes Berg wrote: > > FWIW, I still completely disagree on that taint. You (Luis) obviously > > have been running into a bug in that driver, I doubt the firmware > > actually managed to wedge the hardware. > > This hasn't happened just once, its happed many times sporadically now, > once a week or two weeks I'd say. And the system isn't being moved > around. > > > But even if it did, that's still not really a kernel taint. The kernel > > itself isn't in any way affected by this. > > Of course it is, a full reboot is required. > > > Yes, the system is in a weird state now. But that's *not* equivalent to > > "kernel tainted". > > Requiring a full reboot is a dire situation to be in, and loosing > connectivity to the point this is not recoverable likewise. > > You guys are making out a taint to be the end of the world. We have a > taint even for a kernel warning, and as others have mentioned mac80211 > already produces these. > I had to go RTFM re: kernel taints because it has been a very long time since I looked at them. It had always seemed to me that most were caused by "kernel-unfriendly" user actions. The most famous of course is loading proprietary modules, out-of-tree modules, forced module loads, etc... Honestly, I had forgotten the large variety of uses of the taint flags. For anyone who hasn't looked at taints recently, I recommend: https://www.kernel.org/doc/html/latest/admin-guide/tainted-kernels.html In light of this I don't object to setting a taint on this anymore. I'm a little uneasy, but I've softened on it now, and now I feel it depends on implementation. Specifically, I don't think we should set a taint flag when a driver easily handles a routine firmware crash and is confident that things have come up just fine again. In other words, triggering the taint in every driver module where it spits out a log comment that it had a firmware crash and had to recover seems too much. Sure, firmware shouldn't crash, sure it should be open source so we can fix it, whatever... those sort of wishful comments simply ignore reality and our ability to affect effective change. A lot of WiFi firmware crashes and for well-known cases the drivers handle them well. And in some cases, not so well and that should be a place the driver should detect and thus raise a red flag. If a WiFi firmware crash can bring down the kernel, there's either a major driver bug or some very funky hardware crap going on. That sort of thing we should be able to detect, mark with a taint (or something), and fix if within our sphere of influence. I guess what it comes down to me is how aggressive we are about setting the flag. I would like there to be a single solution, or a minimized set depending on what makes sense for the requirements. I haven't had time to look into the alternatives mentioned here so I don't have an informed opinion about the solution. I do think Luis is trying to solve a real problem though. Can we look at this from the point of view of what are the requirements? What is it we're trying to solve? I _think_ that the goal of Luis's original proposal is to report up to the user, at some future point when the user is interested (because something super drastic just occured, but long after the fw crash), that there was a firmware crash without the user having to grep through all logs on the machine. And then if the user sees that flag and suspects it, then they can bother to find it in the logs or do more drastic debugging steps like finding the fw crash in the log and pulling firmware crash dumps, etc. I think the various alternate solutions are great but perhaps solving a superset of features (like adding in user-space notifications etc)? Perhaps different people on these related threads are trying to solve different problems? - Steve