On Sun, Sep 15, 2019 at 11:59:41AM -0700, Linus Torvalds wrote: > > In addition, since you're leaving the door open to bikeshed around > > the timeout valeue, I'd say that while 30s is usually not huge in a > > desktop system's life, it actually is a lot in network environments > > when it delays a switchover. > > Oh, absolutely. > > But in that situation you have a MIS person on call, and somebody who > can fix it. > > It's not like switchovers happen in a vacuum. What we should care > about is that updating a kernel _works_. No regressions. But if you > have some five-nines setup with switchover, you'd better have some > competent MIS people there too. You don't just switch kernels without > testing ;) I mean maybe I didn't use the right term, but typically in networked environments you'll have watchdogs on sensitive devices (e.g. the default gateways and load balancers), which will trigger an instant reboot of the system if something really bad happens. It can range from a dirty oops, FS remounted R/O, pure freeze, OOM, missing process, panic etc. And here the reset which used to take roughly 10s to get the whole services back up for operations suddenly takes 40s. My point is that I won't have issues explaining users that 10s or 13s is the same when they rely on five nices, but trying to argue that 40s is identical to 10s will be a hard position to stand by. And actually there are other dirty cases. Such systems often work in active-backup or active-active modes. One typical issue is that the primary system reboots, the second takes over within one second, and once the primary system is back *apparently* operating, some processes which appear to be present and which possibly have already bound their listening ports are waiting for 30s in getrandom() while the monitoring systems around see them as ready, thus the primary machine goes back to its role and cannot reliably run the service for the first 30 seconds, which roughly multiplies the downtime by 30. That's why I'd like to make it possible to lower it this value (either definitely or by cmdline, as I think it can be fine for all those who care about down time). Willy