resending in plain text... (hope got it right) On Mon, May 9, 2022 at 11:15 AM Yevgeniy Dodis <dodis@xxxxxxxxxx> wrote: > > Hi Jason and all. > > Thank you for starting this fascinating discussion. I generally agree with everything Jason said. In particular, I am not > 100% convinced that the extra cost of the premature next defense is justified.(Although Windows and MacOS are adamant it is > worth it :).) > > But let me give some meta points to at least convince you this is not as obvious as Jason makes it sound. > > 1) Attacking RNGs in any model is really hard. Heck, everybody knew for years that /dev/random is a mess > (and we published it formally in 2013, although this was folklore knowledge), but in all these years nobody > (even Nadya's group :)) managed to find a practical attack. So just because the attack seems far-fetched, I do not think we should > lower our standards and do ugly stuff. Otherwise, just leave /dev/random the way it was before Jason started his awesome work. > > 2) As Jason says, there are two distinct attack vectors needed to make the premature next attack. > A) compromising the state > B) (nearly) continuously observing RNG outputs > > I agree with Jason's point that finding places where > -- A)+B) is possible, but > --- A)+A) is not possible, > is tricky. Although Nadya kind of indicated a place like that. VM1 and VM2 start with the same RNG state (for whatever > reason). VM1 is insecure, so can leak the state via A). VM2 is more secure, but obviously allows for B) through system > interface. This does not seem so hypothetical for me, especially in light of my mega-point 1) above -- almost any real-world > RNG attack is hard. > > But I want to look at it from a different angle here. Let's ask if RNGs should be secure against A) or B) individually. > > I think everybody agrees protection from B) is a must. This is the most basic definition of RNG! So let's just take itas > an axiom. > > Protection against A) is trickier. But my read of Jason's email is that all his criticism comes exactly from this point. > If your system allows for state compromise, you have bigger problems than the premature next, etc. But let's ask ourselves > the question. Are we ready to design RNGs without recovery from state compromise? I believe nobody on this list would > be comfortable saying "yes". Because this would mean we don;t need to accumulate entropy beyond system start-up. > Once we reach the point of good initial state, and state compromise is not an issue, just use straight ChaCha or whatever other > stream cipher. > > The point is, despite all arguments Jason puts, we all would feel extremely uncomfortable/uneasy to let continuous > entropy accumulation go, right? > > This means we all hopefully agree that we need protection against A) and B) individually. > > 3) Now comes the question. If we want to design a sound RNG using tools of modern cryptography, and we allow > the attacker an individual capability to enforce A) or B) individually, are we comfortable with the design where we: > * offer protection against A) > * offer protection against B) > * do NOT offer protection against A)+B), because we think it's too expensive given A)+B) is so rare? > > I do not have a convincing answer to this question, but it is at least not obvious to me. On a good note, one worry > we might have is how to even have a definition protecting A), protecting B), but not protecting A)+B). > Fortunately, our papers resolve this question (although there are still theoretical annoyances which I do not > want to get into in this email). So, at least from this perspective, we are good. We have a definition with > exactly these (suboptimal) properties. > > Anyway, these are my 2c. > Thoughts? > > Yevgeniy > > On Sun, May 1, 2022 at 7:17 AM Jason A. Donenfeld <Jason@xxxxxxxxx> wrote: >> >> Hi Ted, >> >> That's a useful analysis; thanks for that. >> >> On Sat, Apr 30, 2022 at 05:49:55PM -0700, tytso wrote: >> > On Wed, Apr 27, 2022 at 03:58:51PM +0200, Jason A. Donenfeld wrote: >> > > >> > > 3) More broadly speaking, what kernel infoleak is actually acceptable to >> > > the degree that anybody would feel okay in the first place about the >> > > system continuing to run after it's been compromised? >> > >> > A one-time kernel infoleak where this might seem most likely is one >> > where memory is read while the system is suspended/hibernated, or if >> > you have a VM which is frozen and then replicated. A related version >> > is one where a VM is getting migrated from one host to another, and >> > the attacker is able to grab the system memory from the source "host" >> > after the VM is migrated to the destination "host". >> >> You've identified ~two places where compromises happen, but it's not an >> attack that can just be repeated simply by re-running `./sploit > state`. >> >> 1) Virtual machines: >> >> It seems like after a VM state compromise during migration, or during >> snapshotting, the name of the game is getting entropy into the RNG in a >> usable way _as soon as possible_, and not delaying that. This is >> Nadia's point. There's some inherent tension between waiting some amount >> of time to use all available entropy -- the premature next requirement >> -- and using everything you can as fast as you can because your output >> stream is compromised/duplicated and that's very bad and should be >> mitigated ASAP at any expense. >> >> [I'm also CC'ing Tom Risenpart, who's been following this thread, as he >> did some work regarding VM snapshots and compromise, and what RNG >> recovery in that context looks like, and arrived at pretty similar >> points.] >> >> You mentioned virtio-rng as a mitigation for this. That works, but only >> if the data read from it are actually used rather quickly. So probably >> /waiting/ to use that is suboptimal. >> >> One of the things added for 5.18 is this new "vmgenid" driver, which >> responds to fork/snapshot notifications from hypervisors, so that VMs >> can do something _immediately_ upon resumption/migration/etc. That's >> probably the best general solution to that problem. >> >> Though vmgenid is supported by QEMU, VMware, Hyper-V, and hopefully soon >> Firecracker, there'll still be people that don't have it for one reason >> or another (and it has to be enabled manually in QEMU with `-device >> vmgenid,guid=auto`; perhaps I should send a patch adding that to some >> default machine types). Maybe that's their problem, but I take as your >> point that we can still try to be less bad than otherwise by using more >> entropy more often, and not delaying as the premature next model >> requirements would have us do. >> >> 2) Suspend / hibernation: >> >> This is kind of the same situation as virtual machines, but the >> particulars are a little bit different: >> >> - There's no hypervisor giving us new seed material on resumption like >> we have with VM snapshots and vmgenid; but >> >> - We also always know when it happens, because it's not transparent to >> the OS, so at least we can attempt to do something immediately like >> we do with the vmgenid driver. >> >> Fortunately, most systems that are doing suspend or hibernation these >> days also have a RDRAND-like thing. It seems like it'd be a good idea >> for me to add a PM notifier, mix into the pool both >> ktime_get_boottime_ns() and ktime_get(), in addition to whatever type >> info I get from the notifier block (suspend vs hibernate vs whatever >> else) to account for the amount of time in the sleeping state, and then >> immediately reseed the crng, which will pull in a bunch of >> RDSEED/RDRAND/RDTSC values. This way on resumption, the system is always >> in a good place. >> >> I did this years ago in WireGuard -- clearing key material before >> suspend -- and there are some details around autosuspend (see >> wg_pm_notification() in drivers/net/wireguard/device.c), but it's not >> that hard to get right, so I'll give it a stab and send a patch. >> >> > But if the attacker can actually obtain internal state from one >> > reconstituted VM, and use that to attack another reconstituted VM, and >> > the attacker also knows what the nonce or time seed that was used so >> > that different reconstituted VMs will have unique CRNG streams, this >> > might be a place where the "premature next" attack might come into >> > play. >> >> This is the place where it matters, I guess. It's also where the >> tradeoff's from Nadia's argument come into play. System state gets >> compromised during VM migration / hibernation. It comes back online and >> starts doling out compromised random numbers. Worst case scenario is >> there's no RDRAND or vmgenid or virtio-rng, and we've just got the good >> old interrupt handler mangling cycle counters. Choices: A) recover from >> the compromise /slowly/ in order to mitigate premature next, or B) >> recover from the compromise /quickly/ in order to prevent things like >> nonce reuse. >> >> What is more likely? That an attacker who compromised this state at one >> point in time doesn't have the means to do it again elsewhere in the >> pipeline, will use a high bandwidth /dev/urandom output stream to mount >> a premature next attack, and is going after a high value target that >> inexplicably doesn't have RDRAND/vmgenid/virtio-rng enabled? Or that >> Nadia's group (or that large building in Utah) will get an Internet tap >> and simply start looking for repeated nonces to break? >> >> Jason