Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver

Alexander Graf <graf@xxxxxxxxx> · Sat, 17 Oct 2020 20:09:06 +0200

Hi Jason,

On 17.10.20 15:24, Jason A. Donenfeld wrote:

After discussing this offline with Jann a bit, I have a few general
comments on the design of this.

First, the UUID communicated by the hypervisor should be consumed by
the kernel -- added as another input to the rng -- and then userspace

We definitely want a kernel internal notifier as well, yes :).

should be notified that it should reseed any userspace RNGs that it
may have, without actually communicating that UUID to userspace. IOW,

I also tend to agree that it makes sense to disconnect the actual UUID 
we receive from the notification to user space. This would allow us to 
create a generic mechanism for VM save/restore cycles across different 
hypervisors. Let me add PPC and s390x people to the CC list to see 
whether they have anything remotely similar to the VmGenID mechanism. 
For x86 and aarch64, the ACPI and memory based VmGenID implemented here 
is the most obvious option to implement IMHO. It's also already 
implemented in all major hypervisors.

I agree with Jann there. Then, it's the functioning of this
notification mechanism to userspace that is interesting to me.

Absolutely! Please have a look at the previous discussion here:

https://lore.kernel.org/linux-pm/B7793B7A-3660-4769-9B9A-FFCF250728BB@xxxxxxxxxx/

The user space interface is absolutely what this is about.

There are a few design goals of notifying userspace: it should be
fast, because people who are using userspace RNGs are usually doing so
in the first place to completely avoid syscall overhead for whatever
high performance application they have - e.g. I recall conversations
with Colm about his TLS implementation needing to make random IVs
_really_ fast. It should also happen as early as possible, with no
race or as minimal as possible race window, so that userspace doesn't
begin using old randomness and then switch over after the damage is
already done.

There are multiple facets and different types of consumers here. For a 
user space RNG, I agree that fast and as race free as possible is key. 
That's what the mmap interface is there for.

There are applications way beyond that though. What do you do with 
applications that already consumed randomness? For example a cached pool 
of SSL keys. Or a higher level language primitive that consumes 
randomness and caches its seed somewhere in an internal data structure. 
Or even worse: your system's host ssh key.

For those types of events, an mmap (or vDSO) interface does not work. We 
need to actively allow user space applications to readjust to the new 
environment - either internally (the language primitive case) or through 
a system event, maybe even as systemd trigger (the ssh host key case).

To give everyone enough time before we consider a system as "updated to 
the new environment", we have the callback logic with the "Orchestrator" 
that can check whether all listeners to system wide updates confirms 
they adjusted themselves.

That's what the rest of the logic is there for: A read+poll interface 
and all of the orchestration logic. It's not for the user space RNG 
case, it's for all of its downstream users.

I'm also not wedded to using Microsoft's proprietary hypervisor design
for this. If we come up with a better interface, I don't think it's
asking too much to implement that and reasonably expect for Microsoft
to catch up. Maybe someone here will find that controversial, but
whatever -- discussing ideal designs does not seem out of place or
inappropriate for how we usually approach things in the kernel, and a
closed source hypervisor coming along shouldn't disrupt that.

The main bonus point on this interface is that Hyper-V, VMware and QEMU 
implement it already. It would be a very natural for into the ecosystem. 
I agree though that we shouldn't have our user space interface 
necessarily dictated by it: Other hypervisors may implement different 
ways such as a simple edge IRQ that gets triggered whenever the VM gets 
resumed.

So, anyway, here are a few options with some pros and cons for the
kernel notifying userspace that its RNG should reseed.

I can only stress again that we should not be laser focused on the RNG 
case. In a lot of cases, data has already been generated by the RNG 
before the snapshot and needs to be reinitialized after the snapshot. In 
other cases such as system UUIDs, it's completely orthogonal to the RNG.

1. SIGRND - a new signal. Lol.

Doable, but a lot of plumbing in user space. It's also not necessarily a 
good for for event notification in most user space applications.

2. Userspace opens a file descriptor that it can epoll on. Pros are
that many notification mechanisms already use this. Cons is that this
requires syscall and might be more racy than we want. Another con is
that this a new thing for userspace programs to do.

That's part of what this patch does, right? This patch implements 
read+poll as well as mmap() for high speed reads.

3. We stick an atomic counter in the vDSO, Jann's suggestion. Pros are
that this is extremely fast, and also simple to use and implement.
There are enough sequence points in typical crypto programs that
checking to see whether this counter has changed before doing whatever
operation seems easy enough. Cons are that typically we've been
conservative about adding things to the vDSO, and this is also a new
thing for userspace programs to do.

The big con is that its use is going to be super limited to applications 
that can be adapted to check their "vm generation" through a vDSO call / 
read every time they consume data that may potentially need to be 
regenerated.

This probably works for the pure RNG case. It falls apart for more 
sophisticated things such as "redo my ssh host keys and restart the 
service" or "regenerate my samba machine uuid".

4. We already have a mechanism for this kind of thing, because the
same issue comes up when fork()ing. The solution was MADV_WIPEONFORK,
where userspace marks a page to be zeroed when forking, for the
purposes of the RNG being notified when its world gets split in two.
This is basically the same thing as we're discussing here with guest
snapshots, except it's on the system level rather than the process
level, and a system has many processes. But the problem space is still
almost the same, and we could simply reuse that same mechanism. There
are a few implementation strategies for that:

Yup, that's where we started from :). And then we ran into resistance by 
the mm people (on CC here). And then we looked at the problem more in 
depth and checked what it would take to for example implement this for 
user space RNGs in Java. It's ... more complicated than one may think at 
first.

4a. We mess with the PTEs of all processes' pages that are
MADV_WIPEONFORK, like fork does now, when the hypervisor notifies us
to do so. Then we wind up reusing the already existing logic for
userspace RNGs. Cons might be that this usually requires semaphores,
and we're in irq context, so we'd have to hoist to a workqueue, which
means either more wake up latency, or a larger race window.

4b. We just memzero all processes' pages that are MADV_WIPEONFORK,
when the hypervisor notifies us to do so. Then we wind up reusing the
already existing logic for userspace RNGs.

4c. The guest kernel maintains an array of physical addresses that are
MADV_WIPEONFORK. The hypervisor knows about this array and its
location through whatever protocol, and before resuming a
moved/snapshotted/duplicated VM, it takes the responsibility for
memzeroing this memory. The huge pro here would be that this
eliminates all races, and reduces complexity quite a bit, because the
hypervisor can perfectly synchronize its bringup (and SMP bringup)
with this, and it can even optimize things like on-disk memory
snapshots to simply not write out those pages to disk.

A 4c-like approach seems like it'd be a lot of bang for the buck -- we
reuse the existing mechanism (MADV_WIPEONFORK), so there's no new
userspace API to deal with, and it'd be race free, and eliminate a lot
of kernel complexity.

But 4b and 3 don't seem too bad either.

Any thoughts on 4c? Is that utterly insane, or does that actually get
us somewhere close to what we want?

All of the options for "4" are possible and have an RFC out. Please 
check out the discussion linked above :).

The problem with anything that relies on close loop reads (options 3+4) 
is not going to work well with the more sophisticated use case of 
derived data.

IMHO it will boil down to "both". We will need a high-speed interface 
that with close-to-0 overhead tells you either the generation ID or 
clears pages (options 3+4) as well as something that is bigger for 
applications that can either intrinsically (sshd) or by system design 
(Java) not adopt the mechanisms above easily.

That said, we need to start somewhere. I don't mind which angle we start 
from. But this is a real world problem and one that will only become 
more prevalent over time as VMs are used for more than only your 
traditional enterprise hardware consolidation.

Alex

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879