On Tue, Mar 01, 2022 at 04:42:47PM +0100, Jason A. Donenfeld wrote: > Hey folks, > > Having finally wrapped up development of the initial vmgenid driver, I > thought I'd pull together some thoughts on vmgenid, notification, and > propagating, from disjointed conversations I've had with a few of you > over the last several weeks. > > The basic problem is: VMs can be cloned, forked, rewound, or > snapshotted, and when this happens, a) the RNG needs to reseed itself, > and b) cryptographic algorithms that are not reuse resistant need to > reinitialize in one way or another. For 5.18, we're handling (a) via the > new vmgenid driver, which implements a spec from Microsoft, whereby the > driver receives ACPI notifications when a 16 byte unique value changes. > > The vmgenid driver basically works, though it is racy, because that ACPI > notification can arrive after the system is already running again. This > race is even worse on Windows, where they kick the notification into a > worker thread, which then publishes it upward elsewhere to another async > mechanism, and eventually it hits the RNG and various userspace apps. > On Linux it's not that bad -- we reseed immediately upon receiving the > notification -- but it still inherits this same "push"-model deficiency, > which a "pull"-model would not have. > > If we had a "pull" model, rather than just expose a 16-byte unique > identifier, the vmgenid virtual hardware would _also_ expose a > word-sized generation counter, which would be incremented every time the > unique ID changed. Then, every time we would touch the RNG, we'd simply > do an inexpensive check of this memremap()'d integer, and reinitialize > with the unique ID if the integer changed. In this way, the race would > be entirely eliminated. We would then be able to propagate this outwards > to other drivers, by just exporting an extern symbol, in the manner of > `jiffies`, and propagate it upwards to userspace, by putting it in the > vDSO, in the manner of gettimeofday. And like that, there'd be no > terrible async thing and things would work pretty easily. I am not sure what the difference is though. So we have a 16 byte unique value and you would prefer a dword counter. How is the former not a superset of the later? I'm not sure how safe it is to expose it to userspace specifically, but rest of text talks about exposing it to a kernel driver so maybe not an issue? So what makes interrupt driven required, and why not just remap and read existing vmgenid in the pull manner? What did I miss? > But that's not what we have, because Microsoft didn't collaborate with > anybody on this, and now it's implemented in several hypervisors. Given > that I'm already spending considerable time working on the RNG, entirely > without funding, somehow I'm not super motivated to lead a > cross-industry political effort to change Microsoft's vmgenid spec. > Maybe somebody else has an appetite for this, but either way, those > changes would be several years off at best. > > So given we have a "push"-model mechanism, there are two problems to > tackle, perhaps in the same way, perhaps in a different way: > > A) Outwards propagation toward other kernel drivers: in this case, I > have in mind WireGuard, naturally, which very much needs to clear its > existing sessions when VMs are forked. > > B) Upwards propagation to userspace: in this case, we handle the > concerns of the Amazon engineers on this thread who broached this > topic a few years ago, in which s2n, their TLS library, wants to > reinitialize its userspace RNG (a silly thing, but I digress) and > probably clear session keys too, for the same good reason as > WireGuard. > > For (A), at least wearing my WireGuard-maintainer hat, there is an easy > way and there is a "race-free" way. I use scare quotes there because > we're still in a "push"-model, which means it's still racy no matter > what. > > The faux "race-free" way involves having `extern u32 rng_vm_generation;` > or similar in random.h, and then everything that generates a session key > would snapshot this value, and every time a session key is used, a > comparison would be made. This works, but given that we're going to be > racy no matter what, I think I'd prefer avoiding the extra code in the > hot path and extra per-session storage. It seems like that'd involve a > lot of fiddly engineering for no real world benefit. > > The easy way, and the way that I think I prefer, would be to just have a > sync notifier_block for this, just like we have with > register_pm_notifier(). From my perspective, it'd be simplest to just > piggy back on the already existing PM notifier with an extra event, > PM_POST_VMFORK, which would join the existing set of 7, following > PM_POST_RESTORE. I think that'd be coherent. However, if the PM people > don't want to play ball, we could always come up with our own > notifier_block. But I don't see the need. Plus, WireGuard *already* > uses the PM notifier for clearing keys, so code-wise for my use case, > that'd amount adding another case for PM_POST_VMFORK, in addition to the > currently existing PM_HIBERNATION_PREPARE and PM_SUSPEND_PREPARE cases, > which all would be treated the same way. Ezpz. So if that sounds like an > interesting thing to the PM people, I think I'd like to propose a patch > for that, possibly even for 5.18, given that it'd be very straight- > forward. > > For (B), it's a little bit trickier. But I think our options follow the > same rubric. We can expose a generation counter in the vDSO, with > semantics akin to the extern integer I described above. Or we could > expose that counter in a file that userspace could poll() on and receive > notifications that way. Or perhaps a third way. I'm all ears here. > Alex's team from Amazon last year proposed something similar to the vDSO > idea, except using mmap on a sysfs file, though from what I can tell, > that wound up being kind of complicated. Due to the fact that we're > _already_ racy, I think I'm most inclined at this point toward the > poll() approach for the same reasons as I prefer a notifier_block. But > on userspace I could be convinced otherwise, and I'd be interested in > totally different ideas here too. > > Another thing I should note is that, while I'm not currently leaning > toward it, the vDSO approach also ties into interesting discussions > about userspace RNGs (generally a silly idea), and their need for things > like fork detection and also learning when the kernel RNG was last > reseeded. So cracking open the vDSO book might invite all sorts of other > interesting questions and discussions, which may be productive or may be > a humongous distraction. (Also, again, I'm not super enthusiastic about > userspace RNGs.) > > Also, there is an interesting question to decide with regards to > userspace, which is whether the vmgenid driver should expose its unique > ID to userspace, as Alex requested on an earlier thread. I am actually > sort of opposed to this. That unique ID may or may not be secret and > entropic; if it isn't, the crypto is designed to not be impacted > negatively, but if it is, we should keep it secret. So, rather, I think > the correct flow is that userspace simply calls getrandom() upon > learning that the VM forked, which is guaranteed to have been > reinitialized already by add_vmfork_randomness(), and that will > guarantee a value that is unique to the VM, without having to actually > expose that value. > > So, anyway, this is more or less where my thinking on this matter is. > Would be happy to hear some fresh ideas here too. > > Regards, > Jason