Re: propagating vmgenid outward and upward

Alexander Graf <graf@xxxxxxxxxx> · Wed, 9 Mar 2022 11:10:15 +0100

On 01.03.22 16:42, Jason A. Donenfeld wrote:

Hey folks,

Having finally wrapped up development of the initial vmgenid driver, I
thought I'd pull together some thoughts on vmgenid, notification, and
propagating, from disjointed conversations I've had with a few of you
over the last several weeks.

The basic problem is: VMs can be cloned, forked, rewound, or
snapshotted, and when this happens, a) the RNG needs to reseed itself,
and b) cryptographic algorithms that are not reuse resistant need to
reinitialize in one way or another. For 5.18, we're handling (a) via the
new vmgenid driver, which implements a spec from Microsoft, whereby the
driver receives ACPI notifications when a 16 byte unique value changes.

The vmgenid driver basically works, though it is racy, because that ACPI
notification can arrive after the system is already running again. This

I believe enough people already pointed out that this assumption is 
incorrect. The thing that is racy about VMGenID is the interrupt based 
notification. The actual identifier is updated before the VM resumes 
from its clone operation, so if you match on that you will know whether 
you are in a new or old world. And that is enough to create 
transactions: Save the identifier before a "crypto transaction", 
validate before you finish, if they don't match, abort, reseed and replay.

race is even worse on Windows, where they kick the notification into a
worker thread, which then publishes it upward elsewhere to another async
mechanism, and eventually it hits the RNG and various userspace apps.
On Linux it's not that bad -- we reseed immediately upon receiving the
notification -- but it still inherits this same "push"-model deficiency,
which a "pull"-model would not have.

If we had a "pull" model, rather than just expose a 16-byte unique
identifier, the vmgenid virtual hardware would _also_ expose a
word-sized generation counter, which would be incremented every time the
unique ID changed. Then, every time we would touch the RNG, we'd simply
do an inexpensive check of this memremap()'d integer, and reinitialize
with the unique ID if the integer changed. In this way, the race would
be entirely eliminated. We would then be able to propagate this outwards
to other drivers, by just exporting an extern symbol, in the manner of
`jiffies`, and propagate it upwards to userspace, by putting it in the
vDSO, in the manner of gettimeofday. And like that, there'd be no
terrible async thing and things would work pretty easily.

But that's not what we have, because Microsoft didn't collaborate with
anybody on this, and now it's implemented in several hypervisors. Given
that I'm already spending considerable time working on the RNG, entirely
without funding, somehow I'm not super motivated to lead a
cross-industry political effort to change Microsoft's vmgenid spec.
Maybe somebody else has an appetite for this, but either way, those
changes would be several years off at best.

So given we have a "push"-model mechanism, there are two problems to
tackle, perhaps in the same way, perhaps in a different way:

A) Outwards propagation toward other kernel drivers: in this case, I
    have in mind WireGuard, naturally, which very much needs to clear its
    existing sessions when VMs are forked.

B) Upwards propagation to userspace: in this case, we handle the
    concerns of the Amazon engineers on this thread who broached this
    topic a few years ago, in which s2n, their TLS library, wants to
    reinitialize its userspace RNG (a silly thing, but I digress) and
    probably clear session keys too, for the same good reason as
    WireGuard.

For (A), at least wearing my WireGuard-maintainer hat, there is an easy
way and there is a "race-free" way. I use scare quotes there because
we're still in a "push"-model, which means it's still racy no matter
what.

The faux "race-free" way involves having `extern u32 rng_vm_generation;`
or similar in random.h, and then everything that generates a session key
would snapshot this value, and every time a session key is used, a
comparison would be made. This works, but given that we're going to be
racy no matter what, I think I'd prefer avoiding the extra code in the
hot path and extra per-session storage. It seems like that'd involve a
lot of fiddly engineering for no real world benefit.

The easy way, and the way that I think I prefer, would be to just have a
sync notifier_block for this, just like we have with
register_pm_notifier(). From my perspective, it'd be simplest to just
piggy back on the already existing PM notifier with an extra event,
PM_POST_VMFORK, which would join the existing set of 7, following
PM_POST_RESTORE. I think that'd be coherent. However, if the PM people
don't want to play ball, we could always come up with our own
notifier_block. But I don't see the need. Plus, WireGuard *already*
uses the PM notifier for clearing keys, so code-wise for my use case,
that'd amount adding another case for PM_POST_VMFORK, in addition to the
currently existing PM_HIBERNATION_PREPARE and PM_SUSPEND_PREPARE cases,
which all would be treated the same way. Ezpz. So if that sounds like an
interesting thing to the PM people, I think I'd like to propose a patch
for that, possibly even for 5.18, given that it'd be very straight-
forward.

For (B), it's a little bit trickier. But I think our options follow the
same rubric. We can expose a generation counter in the vDSO, with
semantics akin to the extern integer I described above. Or we could
expose that counter in a file that userspace could poll() on and receive
notifications that way. Or perhaps a third way. I'm all ears here.
Alex's team from Amazon last year proposed something similar to the vDSO
idea, except using mmap on a sysfs file, though from what I can tell,
that wound up being kind of complicated. Due to the fact that we're
_already_ racy, I think I'm most inclined at this point toward the
poll() approach for the same reasons as I prefer a notifier_block. But
on userspace I could be convinced otherwise, and I'd be interested in
totally different ideas here too.

Another thing I should note is that, while I'm not currently leaning
toward it, the vDSO approach also ties into interesting discussions
about userspace RNGs (generally a silly idea), and their need for things
like fork detection and also learning when the kernel RNG was last
reseeded. So cracking open the vDSO book might invite all sorts of other
interesting questions and discussions, which may be productive or may be
a humongous distraction. (Also, again, I'm not super enthusiastic about
userspace RNGs.)

Also, there is an interesting question to decide with regards to
userspace, which is whether the vmgenid driver should expose its unique
ID to userspace, as Alex requested on an earlier thread. I am actually
sort of opposed to this. That unique ID may or may not be secret and
entropic; if it isn't, the crypto is designed to not be impacted
negatively, but if it is, we should keep it secret. So, rather, I think
the correct flow is that userspace simply calls getrandom() upon
learning that the VM forked, which is guaranteed to have been
reinitialized already by add_vmfork_randomness(), and that will
guarantee a value that is unique to the VM, without having to actually
expose that value.

If you follow the logic at the beginning of the mail, you can create 
something race free if you consume the hardware VMGenID counter. You can 
not make it race free if you rely on the interrupt mechanism.

So following that train of thought, if you expose the hardware VMGenID 
to user space, you could allow user space to act race free based on 
VMGenID. That means consumers of user space RNGs could validate whether 
the ID is identical between the beginning of the crypto operation and 
the end.

That said, there are 2 pieces to the puzzle of user space notification: 
Polling and event based. The part above solves the polling use cases - 
user space libraries that just want to know whether they are now in a 
new world.

However, there are more complicated cases as well. What do you do with 
Samba for example? It needs to generate a new SID after the clone. 
That's a super heavy operation. Do you want to have smbd constantly poll 
on the VMGenID just to see whether it needs to kick off some 
administrative actions?

For the event based approach, we're in the same boat as "S3 resume" - we 
need a global notification mechanism that the state of the system 
changed and act accordingly. That's where the systemd proposal[1] comes 
in: Create inhibitors and scriptlets that get spawned when we want to 
suspend and then resume-cloned later. I'm personally even ok if we just 
limit that whole use case to cloning while you're in S3 only.

In that case, all we would need from the kernel is an easily readable 
GenID that changes before systemd runs again after suspend: Systemd 
wakes up after resume, checks if the GenID changed and if so, invokes 
the unquiescing target in after the resume one.

For this particular use case we're not in the fast path, so we could 
make GenID reading a syscall which checks against VMGenID. But that 
won't cut it for the polling use case.

I'm also not a super big fan of putting all that logic into systemd. It 
means applications need to create their own notification mechanisms to 
pass that cloning notification into actual processes. Don't we have any 
mechanism that applications and libraries could use to natively get an 
event when the GenID changes?

Alex

[1] https://github.com/systemd/systemd/issues/20222

So, anyway, this is more or less where my thinking on this matter is.
Would be happy to hear some fresh ideas here too.

Regards,
Jason

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879