On Fri, Sep 20, 2019 at 12:12 PM Andy Lutomirski <luto@xxxxxxxxxx> wrote: > > The problem is that new programs will have to try the new flag value > and, if it returns -EINVAL, fall back to 0. This isn't so great. Don't be silly. Of course they will do that, but so what? With a new kernel, they'll get the behavior they expect. And with an old kernel, they'll get the behavior they expect. They'd never fall back to to "0 means something I didn't want", exactly because we'd make this new flag be the first change. > Wait, are you suggesting that 0 means invoke jitter-entropy or > whatever and GRND_SECURE_BLOCKING means not wait forever and deadlock? > That's no good -- people will want to continue using 0 because the > behavior is better. I assume that "not wait forever" was meant to be "wait forever". So the one thing we have to do is break the "0 waits forever". I guarantee that will happen. I will override Ted if he just NAk's it, because we simply _cannot_ continue with it. So we absolutely _will_ come up with some way 0 ends the wait. Whether it's _just_ a timeout, or whether it's jitter-entropy or whatever, it will happen. But we'll also make getrandom(0) do the annoying warning, because it's just ambiguous. And I suspect you'll find that a lot of security people don't really like jitter-entropy, at least not in whatever cut-down format we'll likely have to use in the kernel. And we'll also have to make getrandom(0) be really _timely_. Security people would likely rather wait for minutes before they are happy with it. But because it's a boot constraint as things are now, it will not just be jitter-entropy, it will be _accelerated_ jitter-entropy in 15 seconds or whatever, and since it can't use up all of CPU time, it's realistically more like "15 second timeout, but less of actual CPU time for jitter". We can try to be clever with a background thread and a lot of yielding(), so that if the CPU is actually idle we'll get most of that 15 seconds for whatever jitter, but end result is that it's still accelerated. Do I believe we can do a good job in that kind of timeframe? Absolutely. The whole point should be that it's still "good enough", and as has been pointed out, that same jitter entropy that people are worried about is just done in user space right now instead. But do I believe that security people would prefer a non-accelerated GRND_SECURE_BLOCKING? Yes I do. That doesn't mean that GRND_SECURE_BLOCKING shouldn't use jitter entropy too, but it doesn't need the same kind of "let's hurry this up because it might be during early boot and block things". That said, if we can all convince everybody (hah!) that jitter entropy in the kernel would be sufficient, then we can make the whole point entirely moot, and just say "we'll just change crng_wait() to do jitter entropy instead and be done with it. Then any getrandom() user will just basically wait for a (very limited) time and the system will be happy. If that is the case we wouldn't need new flags at all. But I don't think you can make everybody agree to that, which is why I suspect we'll need the new flag, and I'll just take the heat for saying "0 is now off limits, because it does this thing that a lot of people dislike". > IMO this is confusing. The GRND_RANDOM flag was IMO a mistake and > should just be retired. Let's enumerate useful cases and then give > them sane values. That's basically what I'm doing. I enumerate the new values. But the enumerations have hidden meaning, because the actual bits do matter. The GRND_EXPLICIT bit isn't supposed to be used by any user, but it has the value it has because it makes old kernels return -EINVAL. But if people hate the bit names, we can just do an enum and be done with it: enum grnd_flags { GRND_NONBLOCK = 1, GRND_RANDOM, // Don't use! GRND_RANDOM_NONBLOCK, // Don't use GRND_UNUSED, GRND_INSECURE, GRND_SECURE_BLOCKING, GRND_SECURE_NONBLOCKING, }; but the values now have a _hidden_ pattern (because we currently have that "| GRND_NONBLOCK" pattern that I want to make sure still continues to work, rather than give unexpected behavior in case somebody continues to use it). So the _only_ difference between the above and what I suggested is that I made the bit pattern explicit rather than hidden in the value. > And the only real question is how to map existing users to these > semantics. I see two sensible choices: > > 1. 0 means "secure, blocking". I think this is not what we'd do if we > could go back in time and chage the ABI from day 1, but I think it's > actually good enough. As long as this mode won't deadlock, it's not > *that* bad if programs are using it when they wanted "insecure". It's exactly that "as long as it won't deadlock" that is our current problem. It *does* deadlock. So it can't mean "blocking" in any long-term meaning. It can mean "blocks for up to 15 seconds" or something like that. I'd honestly prefer a smaller number, but I think 15 seconds is an acceptable "your user space is buggy, but we won't make you think the machine hung". > 2. 0 means "secure, blocking, but warn". Some new value means > "secure, blocking, don't warn". The problem is that new applications > will have to fall back to 0 to continue supporting old kernels. The same comment about blocking. Maybe you came in in the middle, and didn't see the whole "reduced IO patterns means that boot blocks forever" part of the original problem. THAT is why 0 will absolutely change behaviour. Linus