I’ve not followed the whole discussion, but I did some tests a couple of months ago replacing the calls to memcpy in the memmap engine with a new fio_memcpy function, rebuilding fio with the -ftree-vectorize flag along with -msse2, -mavx etc. I saw an improvement of a few hundred MB/sec on my openSUSE system. — Rebecca Sent from my iPhone > On Nov 28, 2017, at 9:13 PM, Elliott, Robert (Persistent Memory) <elliott@xxxxxxx> wrote: > > >> -----Original Message----- >> From: Gavriliuk, Anton (HPS Ukraine) >> Sent: Tuesday, November 28, 2017 9:35 PM >> To: Elliott, Robert (Persistent Memory) <elliott@xxxxxxx>; Rebecca Cran >> <rebecca@xxxxxxxxxxxx>; Sitsofe Wheeler <sitsofe@xxxxxxxxx> >> Cc: fio@xxxxxxxxxxxxxxx; Kani, Toshimitsu <toshi.kani@xxxxxxx> >> Subject: RE: fio 3.2 > > >> 4. The CPU instructions used by fio depend on the glibc library version. >> As mentioned in an earlier fio thread, that changes a lot. With >> libc2.24.so, random reads seem to be done with rep movsb. >> >> The test box SLES12 SP3 has glibc2.22, so we have to update. But I can't >> understand what does it mean - " random reads seem to be done with rep >> movsb " > > While the test is running, run > perf top > > and select one of the busiest functions (hopefully a memcpy function): > > 90.14% libc-2.24.so [.] __memmove_avx_unaligned_erms > 3.85% [unknown] [k] 0x00007f7414f5c4ff > 0.53% [kernel] [k] wait_consider_task > 0.33% fio [.] get_io_u > 0.29% fio [.] td_io_queue > 0.26% fio [.] io_u_sync_complete > > Hit enter twice to get to the assembly language code. This shows you > the amount of time the CPUs are spending in each instruction (based on > sampling): > > 0.03 │ 2b: cmp __x86_shared_non_temporal_threshold,%rdx > 0.01 │ ↓ jae 15d > │ cmp %rsi,%rdi > 0.00 │ ↓ jb 4c > │ ↓ je 51 > │ lea (%rsi,%rdx,1),%r9 > │ cmp %r9,%rdi > │ ↓ jb 211 > │ 4c: mov %rdx,%rcx > 99.88 │ rep movsb %ds:(%rsi),%es:(%rdi) > 0.08 │ 51: ↠retq > │ 52: cmp $0x10,%dl > > That means 99.88% of 90.14% of the time is spend on rep movsb, presumably > reading from persistent memory (and writing to regular memory, but to > a buffer that's entirely in the CPU caches). > > Some more examples of how to analyze the results: > > If I remove the norandommap option, fio spends about 1% of the CPU time > in a function maintaining the list of which LBAs it has read: > 89.63% libc-2.24.so [.] __memmove_avx_unaligned_erms > 3.28% [unknown] [k] 0x00007f34da4824ff > 0.87% fio [.] axmap_isset > 0.58% [kernel] [k] wait_consider_task > 0.38% fio [.] get_io_u > 0.34% fio [.] io_u_sync_complete > > > If I switch to random writes and remove the zerobuffers option, fio spends > about 7% of the CPU time creating non-zero write data (to then write > into persistent memory): > 79.04% libc-2.24.so [.] __memmove_avx_unaligned_erms > 6.98% fio [.] get_io_u > 4.24% [unknown] [k] 0x00007f8ee86d34ff > 0.85% fio [.] io_queue_event > 0.82% fio [.] axmap_isset > 0.62% fio [.] io_u_sync_complete > 0.62% [kernel] [k] wait_consider_task > 0.58% fio [.] td_io_queue > 0.45% fio [.] thread_main > > get_io_u calls small_content_scramble which calls memcpy for 8 > bytes at a time. On my system, gcc chose to inline memcpy > rather than call the glibc library version, and it ends up just > using regular mov instructions: > > │ memcpy(): > │ > │ __fortify_function void * > │ __NTH (memcpy (void *__restrict __dest, const void *__restrict __src, > │ size_t __len)) > │ { > │ return __builtin___memcpy_chk (__dest, __src, __len, __bos0 (__dest)); > 32.56 │ mov %rax,-0x200(%rcx,%rsi,1) > 5.18 │ mov 0x0(%rbp),%rsi > │ small_content_scramble(): > │ get_verify = 1; > 2.49 │ add $0x200,%rax > 2.81 │ mov 0x8(%rbp),%rdi > │ memcpy(): > 32.91 │ mov %rsi,-0x10(%rcx) > 3.45 │ mov %rdi,-0x8(%rcx) > │ small_content_scramble(): > │ td->trim_batch = td->o.trim_backlog; > │ get_trim = 1; > │ } > │ > │ if (get_trim && get_next_trim(td, io_u)) > │ return true; > > small_content_scramble has hardly been touched since 2011, so it probably > hasn't had much performance analysis. > > One of the few changes made was to add an integer divide by 1000, which > is always slow (painfully slow on some CPUs): > > offset = ((io_u->start_time.tv_nsec/1000) ^ boffset) & 511; > > perf top doesn't show that taking time - I think the compiler realized > it could pull that calculation out of the loop and just do it once. Different > compilers and compiler options might not realize that. > > > èº{.nÇ+‰·Ÿ®‰†+%ŠËlzwm…éb맲æìr¸›yø¨Š{ayºÊ‡Ú™ë,j¢f£¢·hš‹àz¹®w¥¢¸¢·¦j:+v‰¨ŠwèjØm¶Ÿÿ¾«‘êçzZ+ƒùšŽŠÝ¢j"ú!¶i -- To unsubscribe from this list: send the line "unsubscribe fio" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html