On Wed, Mar 02, 2022 at 05:32:07PM +0100, Jason A. Donenfeld wrote: > Hi Michael, > > On Wed, Mar 02, 2022 at 11:22:46AM -0500, Michael S. Tsirkin wrote: > > > Because that 16 byte read of vmgenid is not atomic. Let's say you read > > > the first 8 bytes, and then the VM is forked. > > > > But at this point when VM was forked plaintext key and nonce are all in > > buffer, and you previously indicated a fork at this point is harmless. > > You wrote "If it changes _after_ that point of check ... it doesn't > > matter:" > > Ahhh, fair point. I think you're right. > > Alright, so all we're talking about here is an ordinary 16-byte read, > and 16 bytes of storage per keypair, and a 16-byte comparison. > > Still seems much worse than just having a single word... > > Jason Oh I forgot about __int128. #include <stdio.h> #include <assert.h> #include <limits.h> #include <string.h> struct lng { __int128 l; }; struct shrt { unsigned long s; }; struct lng l = { 1 }; struct shrt s = { 3 }; static void test1(volatile struct shrt *sp) { if (sp->s != s.s) { printf("short mismatch!\n"); s.s = sp->s; } } static void test2(volatile struct lng *lp) { if (lp->l != l.l) { printf("long mismatch!\n"); l.l = lp->l; } } int main(int argc, char **argv) { volatile struct shrt sv = { 4 }; volatile struct lng lv = { 5 }; if (argc > 1) { printf("test 1\n"); for (int i = 0; i < 100000000; ++i) test1(&sv); } else { printf("test 2\n"); for (int i = 0; i < 100000000; ++i) test2(&lv); } return 0; } with that the compiler has an easier time to produce optimal code, so the difference is smaller. Note: compiled with gcc -O2 -mno-sse -mno-sse2 -ggdb bench3.c since with sse there's no difference at all. [mst@tuck ~]$ perf stat -r 100 ./a.out 1 > /dev/null Performance counter stats for './a.out 1' (100 runs): 94.55 msec task-clock:u # 0.996 CPUs utilized ( +- 0.09% ) 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 52 page-faults:u # 548.914 /sec ( +- 0.21% ) 400,459,851 cycles:u # 4.227 GHz ( +- 0.03% ) 500,147,935 instructions:u # 1.25 insn per cycle ( +- 0.00% ) 200,032,462 branches:u # 2.112 G/sec ( +- 0.00% ) 1,810 branch-misses:u # 0.00% of all branches ( +- 0.73% ) 0.0949732 +- 0.0000875 seconds time elapsed ( +- 0.09% ) [mst@tuck ~]$ [mst@tuck ~]$ perf stat -r 100 ./a.out > /dev/null Performance counter stats for './a.out' (100 runs): 110.19 msec task-clock:u # 1.136 CPUs utilized ( +- 0.18% ) 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 52 page-faults:u # 537.743 /sec ( +- 0.22% ) 428,518,442 cycles:u # 4.431 GHz ( +- 0.07% ) 900,147,986 instructions:u # 2.24 insn per cycle ( +- 0.00% ) 200,032,505 branches:u # 2.069 G/sec ( +- 0.00% ) 2,139 branch-misses:u # 0.00% of all branches ( +- 0.77% ) 0.096956 +- 0.000203 seconds time elapsed ( +- 0.21% ) -- MST