On Fri, Sep 20, 2019 at 12:22:17PM -0700, Andy Lutomirski wrote: > Perhaps userland could register a helper that takes over and does > something better? If userland sees the failure it can do whatever the developer/distro packager thought suitable for the system facing this condition. > But I think the kernel really should do something > vaguely reasonable all by itself. Definitely, that's what Linus' proposal was doing. Sleeping for some time is what I call "vaguely reasonable". > If nothing else, we want the ext4 > patch that provoked this whole discussion to be applied, Oh absolutely! > which means > that we need to unbreak userspace somehow, and returning garbage it to > is not a good choice. It depends how it's used. I'd claim that we certainly use randoms for other things (such as ASLR/hashtables) *before* using them to generate long lived keys thus we can have a bit more time to get some more entropy before reaching the point of producing these keys. > Here are some possible approaches that come to mind: > > int count; > while (crng isn't inited) { > msleep(1); > } > > and modify add_timer_randomness() to at least credit a tiny bit to > crng_init_cnt. Without a timeout it's sure we'll still face some situations where it blocks forever, which is the current problem. > Or we do something like intentionally triggering readahead on some > offset on the root block device. You don't necessarily have such a device, especially when you're in an initramfs. It's precisely where userland can be smarter. When the caller is sfdisk for example, it does have more chances to try to perform I/O than when it's a tiny http server starting to present a configuration page. > We should definitely not trigger *blocking* IO. I think I agree. > Also, I wonder if the real problem preventing the RNG from staring up > is that the crng_init_cnt threshold is too high. We have a rather > baroque accounting system, and it seems like we can accumulate and > credit entropy for a very long time indeed without actually > considering ourselves done. I have no opinion on this, lacking the skills to evaluate the situation. What I can say for sure is that I've faced the non-booting issue quite a number of times on headless systems, and conversely in the 2.4 era, my front reverse-proxy by then had the same SSH key as 89 other machines on the net. So there's surely a sweet spot to find between those two extremes. I tend to think that waiting *a little bit* for the *first* random is acceptable, even 10-15s, by the time the user starts to think about pressing the reset button the system might finish to boot. Hashing some RAM locations and the RTC when present can also help a little bit. If at least my machine by then had combined the RTC's date and time with the hash, chances for a key collision would have gone down to one over many thousands. Willy