On 9/11/21 15:16, Jens Axboe wrote:
Looking at profile: 43.34 │ rep stos %rax,%es:(%rdi) I do wonder if rep stos is just not very well suited for small regions, either in general or particularly on AMD. What do your profiles look like for before and after?
Since I do not know which tool was used to obtain the above information, I ran perf record -ags sleep 10 while the test was running. I could not find bio_init in the output. I think that means that that function got inlined. But bio_alloc_bioset() showed up in the output. The time spent in that function is lower if IOPS are higher. The performance numbers in the patch description come from a Intel Xeon Gold 6154 CPU. I reran the test today on an old Intel Core i7-4790 CPU and obtained the opposite result: higher IOPS without this patch than with this patch although the assembler code looks to be the same. It seems like how fast "rep stos" runs depends on the CPU type? Bart.