Re: propagating vmgenid outward and upward

"Jason A. Donenfeld" <Jason@xxxxxxxxx> · Wed, 2 Mar 2022 12:26:27 +0100

Hey Michael,

Thanks for the benchmark.

On Wed, Mar 2, 2022 at 9:30 AM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:
> So yes, the overhead is higher by 50% which seems a lot but it's from a
> very small number, so I don't see why it's a show stopper, it's not by a
> factor of 10 such that we should sacrifice safety by default. Maybe a
> kernel flag that removes the read replacing it with an interrupt will
> do.
>
> In other words, premature optimization is the root of all evil.

Unfortunately I don't think it's as simple as that for several reasons.

First, I'm pretty confident a beefy Intel machine can mostly hide
non-dependent comparisons in the memory access and have the problem
mostly go away. But this is much less the case on, say, an in-order
MIPS32r2, which isn't just "some crappy ISA I'm using for the sake of
argument," but actually the platform on which a lot of networking and
WireGuard stuff runs, so I do care about it. There, we have 4
reads/comparisons which can't pipeline nearly as well.

There's also the atomicity aspect, which I think makes your benchmark
not quite accurate. Those 16 bytes could change between the first and
second word (or between the Nth and N+1th word for N<=3 on 32-bit).
What if in that case the word you read second doesn't change, but the
word you read first did? So then you find yourself having to do a
hi-lo-hi dance. And then consider the 32-bit case, where that's even
more annoying. This is just one of those things that comes up when you
compare the semantics of a "large unique ID" and "word-sized counter",
as general topics. (My suggestion is that vmgenid provide both.)

Finally, there's a slightly storage aspect, where adding 16 bytes to a
per-key struct is a little bit heavier than adding 4 bytes and might
bust a cache line without sufficient care, care which always has some
cost in one way or another.

So I just don't know if it's realistic to impose a 16-byte per-packet
comparison all the time like that. I'm familiar with WireGuard
obviously, but there's also cifs and maybe even wifi and bluetooth,
and who knows what else, to care about too. Then there's the userspace
discussion. I can't imagine a 16-byte hotpath comparison being
accepted as implementable.

> And I feel if linux
> DTRT and reads the 16 bytes then hypervisor vendors will be motivated to
> improve and add a 4 byte unique one. As long as linux is interrupt
> driven there's no motivation for change.

I reeeeeally don't want to get pulled into the politics of this on the
hypervisor side. I assume an improved thing would begin with QEMU and
Firecracker or something collaborating because they're both open
source and Amazon people seem interested. And then pressure builds for
Microsoft and VMware to do it on their side. And then we get this all
nicely implemented in the kernel. In the meantime, though, I'm not
going to refuse to address the problem entirely just because the
virtual hardware is less than perfect; I'd rather make the most with
what we've got while still being somewhat reasonable from an
implementation perspective.

Jason