Hey Michael, Thanks for the benchmark. On Wed, Mar 2, 2022 at 9:30 AM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote: > So yes, the overhead is higher by 50% which seems a lot but it's from a > very small number, so I don't see why it's a show stopper, it's not by a > factor of 10 such that we should sacrifice safety by default. Maybe a > kernel flag that removes the read replacing it with an interrupt will > do. > > In other words, premature optimization is the root of all evil. Unfortunately I don't think it's as simple as that for several reasons. First, I'm pretty confident a beefy Intel machine can mostly hide non-dependent comparisons in the memory access and have the problem mostly go away. But this is much less the case on, say, an in-order MIPS32r2, which isn't just "some crappy ISA I'm using for the sake of argument," but actually the platform on which a lot of networking and WireGuard stuff runs, so I do care about it. There, we have 4 reads/comparisons which can't pipeline nearly as well. There's also the atomicity aspect, which I think makes your benchmark not quite accurate. Those 16 bytes could change between the first and second word (or between the Nth and N+1th word for N<=3 on 32-bit). What if in that case the word you read second doesn't change, but the word you read first did? So then you find yourself having to do a hi-lo-hi dance. And then consider the 32-bit case, where that's even more annoying. This is just one of those things that comes up when you compare the semantics of a "large unique ID" and "word-sized counter", as general topics. (My suggestion is that vmgenid provide both.) Finally, there's a slightly storage aspect, where adding 16 bytes to a per-key struct is a little bit heavier than adding 4 bytes and might bust a cache line without sufficient care, care which always has some cost in one way or another. So I just don't know if it's realistic to impose a 16-byte per-packet comparison all the time like that. I'm familiar with WireGuard obviously, but there's also cifs and maybe even wifi and bluetooth, and who knows what else, to care about too. Then there's the userspace discussion. I can't imagine a 16-byte hotpath comparison being accepted as implementable. > And I feel if linux > DTRT and reads the 16 bytes then hypervisor vendors will be motivated to > improve and add a 4 byte unique one. As long as linux is interrupt > driven there's no motivation for change. I reeeeeally don't want to get pulled into the politics of this on the hypervisor side. I assume an improved thing would begin with QEMU and Firecracker or something collaborating because they're both open source and Amazon people seem interested. And then pressure builds for Microsoft and VMware to do it on their side. And then we get this all nicely implemented in the kernel. In the meantime, though, I'm not going to refuse to address the problem entirely just because the virtual hardware is less than perfect; I'd rather make the most with what we've got while still being somewhat reasonable from an implementation perspective. Jason