From: Kent Overstreet > Sent: 14 May 2023 06:45 ... > dynamically generated unpack: > rand_insert: 20.0 MiB with 1 threads in 33 sec, 1609 nsec per iter, 607 KiB per sec > > old C unpack: > rand_insert: 20.0 MiB with 1 threads in 35 sec, 1672 nsec per iter, 584 KiB per sec > > the Eric Biggers special: > rand_insert: 20.0 MiB with 1 threads in 35 sec, 1676 nsec per iter, 583 KiB per sec > > Tested two versions of your approach, one without a shift value, one > where we use a shift value to try to avoid unaligned access - second was > perhaps 1% faster You won't notice any effect of avoiding unaligned accesses on x86. I think then get split into 64bit accesses and again on 64 byte boundaries (that is what I see for uncached access to PCIe). The kernel won't be doing >64bit and the 'out of order' pipeline will tend to cover the others (especially since you get 2 reads/clock). > so it's not looking good. This benchmark doesn't even hit on > unpack_key() quite as much as I thought, so the difference is > significant. Beware: unless you manage to lock the cpu frequency (which is ~impossible on some cpu) timings in nanoseconds are pretty useless. You can use the performance counter to get accurate cycle times (provided there isn't a cpu switch in the middle of a micro-benchmark). David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)