On 9/11/21 9:19 PM, Bart Van Assche wrote: > On 9/11/21 15:16, Jens Axboe wrote: >> Looking at profile: >> >> 43.34 │ rep stos %rax,%es:(%rdi) >> I do wonder if rep stos is just not very well suited for small regions, >> either in general or particularly on AMD. >> >> What do your profiles look like for before and after? > > Since I do not know which tool was used to obtain the above > information, I ran perf record -ags sleep 10 while the test > was running. I could not find bio_init in the output. I think > that means that that function got inlined. But > bio_alloc_bioset() showed up in the output. The time spent in > that function is lower if IOPS are higher. The above is from perf report, diving into the functions. Yours show up in bio_alloc_bioset(), and mine in bio_alloc_kiocb() as I'm doing polled IO. > The performance numbers in the patch description come from a > Intel Xeon Gold 6154 CPU. I reran the test today on an old Intel > Core i7-4790 CPU and obtained the opposite result: higher IOPS > without this patch than with this patch although the assembler > code looks to be the same. It seems like how fast "rep stos" > runs depends on the CPU type? It does appear so. Which is a bit frustrating... -- Jens Axboe