Re: [PATCH RFC v1 0/2] VM fork detection for RNG

Alexander Graf <graf@xxxxxxxxxx> · Thu, 24 Feb 2022 12:35:59 +0100

On 24.02.22 11:43, Daniel P. Berrangé wrote:
On Thu, Feb 24, 2022 at 09:53:59AM +0100, Alexander Graf wrote:
Hey Jason,

On 23.02.22 14:12, Jason A. Donenfeld wrote:
This small series picks up work from Amazon that seems to have stalled
out later year around this time: listening for the vmgenid ACPI
notification, and using it to "do something." Last year, that something
involved a complicated userspace mmap chardev, which seems frought with
difficulty. This year, I have something much simpler in mind: simply
using those ACPI notifications to tell the RNG to reinitialize safely,
so we don't repeat random numbers in cloned, forked, or rolled-back VM
instances.

This series consists of two patches. The first is a rather
straightforward addition to random.c, which I feel fine about. The
second patch is the reason this is just an RFC: it's a cleanup of the
ACPI driver from last year, and I don't really have much experience
writing, testing, debugging, or maintaining these types of drivers.
Ideally this thread would yield somebody saying, "I see the intent of
this; I'm happy to take over ownership of this part." That way, I can
focus on the RNG part, and whoever steps up for the paravirt ACPI part
can focus on that.

As a final note, this series intentionally does _not_ focus on
notification of these events to userspace or to other kernel consumers.
Since these VM fork detection events first need to hit the RNG, we can
later talk about what sorts of notifications or mmap'd counters the RNG
should be making accessible to elsewhere. But that's a different sort of
project and ties into a lot of more complicated concerns beyond this
more basic patchset. So hopefully we can keep the discussion rather
focused here to this ACPI business.

The main problem with VMGenID is that it is inherently racy. There will
always be a (short) amount of time where the ACPI notification is not
processed, but the VM could use its RNG to for example establish TLS
connections.

Hence we as the next step proposed a multi-stage quiesce/resume mechanism
where the system is aware that it is going into suspend - can block network
connections for example - and only returns to a fully functional state after
an unquiesce phase:

   https://github.com/systemd/systemd/issues/20222
The downside of course is precisely that the guest now needs to be aware
and involved every single time a snapshot is taken.

Currently with virt the act of taking a snapshot can often remain invisible
to the VM with no functional effect on the guest OS or its workload, and
the host OS knows it can complete a snapshot in a specific timeframe. That
said, this transparency to the VM is precisely the cause of the race
condition described.

With guest involvement to quiesce the bulk of activity for time period,
there is more likely to be a negative impact on the guest workload. The
guest admin likely needs to be more explicit about exactly when in time
it is reasonable to take a snapshot to mitigate the impact.

The host OS snapshot operations are also now dependant on co-operation
of a guest OS that has to be considered to be potentially malicious, or
at least crashed/non-responsive. The guest OS also needs a way to receive
the triggers for snapshot capture and restore, most likely via an extension
to something like the QEMU guest agent or an equivalent for othuer
hypervisors.

What you describe sounds almost exactly like pressing a power button on 
modern systems. You don't just kill the power line, you press a button 
and wait for the guest to acknowledge that it's ready.

Maybe the real answer to all of this is S3: Suspend to RAM. You press 
the suspend button, the guest can prepare for sleep (quiesce!) and the 
next time you run, it can check whether VMGenID changed and act accordingly.

Despite the above, I'm not against the idea of co-operative involvement
of the guest OS in the acts of taking & restoring snapshots. I can't
see any other proposals so far that can reliably eliminate the races
in the general case, from the kernel right upto user applications.
So I think it is neccessary to have guest cooperative snapshotting.

What exact use case do you have in mind for the RNG/VMGenID update? Can you
think of situations where the race is not an actual concern?
Lets assume we do take the approach described in that systemd bug and
have a co-operative snapshot process. If the hypervisor does the right
thing and guest owners install the right things, they'll have a race
free solution that works well in normal operation. That's good.

Realistically though, it is never going to be universally and reliably
put into practice. So what is our attitude to cases where the preferred
solution isn't availble and/or operative ?

There are going to be users who continue to build their guest disk images
without the QEMU guest agent (or equivalent for whatever hypervisor they
run on) installed because they don't know any better. Or where the guest
agent is mis-configured or fails to starts or some other scenario that
prevents the quiesce working as desired. The host mgmt could refuse to
take a snapshot in these cases. More likely is that they are just
going to go ahead and do a snapshot anyway because lack of guest agent
is a very common scenario today and users want their snapshots.

There are going to be virt management apps / hypervisors that don't
support talking to any guest agent across their snapshot operation
in the first place, so systemd gets no way to trigger the required
quiesce dance on snapshot, but they likely have VMGenID support
implemented already.

IOW, I could view VMGenID triggered fork detection integrated with
the kernel RNG as providing a backup line of defence that is going
to "just work", albeit with the known race. It isn't as good as the
guest co-operative snapshot approach, because it only tries to solve
the one specific targetted problem of updating the kernel RNG.

Is it still better than doing nothing at all though, for the scenario
where guest co-operative snapshot is unavailable ?

If it is better than nothing, is it then compelling enough to justify
the maint cost of the code added to the kernel ?

I'm tempted to say "If it also exposes the VMGenID via sysfs so that you 
can actually check whether you were cloned, probably yes."

Alex

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879