> -----Original Message----- > From: Gavriliuk, Anton (HPS Ukraine) > Sent: Tuesday, November 28, 2017 9:35 PM > To: Elliott, Robert (Persistent Memory) <elliott@xxxxxxx>; Rebecca Cran > <rebecca@xxxxxxxxxxxx>; Sitsofe Wheeler <sitsofe@xxxxxxxxx> > Cc: fio@xxxxxxxxxxxxxxx; Kani, Toshimitsu <toshi.kani@xxxxxxx> > Subject: RE: fio 3.2 > 4. The CPU instructions used by fio depend on the glibc library version. > As mentioned in an earlier fio thread, that changes a lot. With > libc2.24.so, random reads seem to be done with rep movsb. > > The test box SLES12 SP3 has glibc2.22, so we have to update. But I can't > understand what does it mean - " random reads seem to be done with rep > movsb " While the test is running, run perf top and select one of the busiest functions (hopefully a memcpy function): 90.14% libc-2.24.so [.] __memmove_avx_unaligned_erms 3.85% [unknown] [k] 0x00007f7414f5c4ff 0.53% [kernel] [k] wait_consider_task 0.33% fio [.] get_io_u 0.29% fio [.] td_io_queue 0.26% fio [.] io_u_sync_complete Hit enter twice to get to the assembly language code. This shows you the amount of time the CPUs are spending in each instruction (based on sampling): 0.03 │ 2b: cmp __x86_shared_non_temporal_threshold,%rdx 0.01 │ ↓ jae 15d │ cmp %rsi,%rdi 0.00 │ ↓ jb 4c │ ↓ je 51 │ lea (%rsi,%rdx,1),%r9 │ cmp %r9,%rdi │ ↓ jb 211 │ 4c: mov %rdx,%rcx 99.88 │ rep movsb %ds:(%rsi),%es:(%rdi) 0.08 │ 51: ← retq │ 52: cmp $0x10,%dl That means 99.88% of 90.14% of the time is spend on rep movsb, presumably reading from persistent memory (and writing to regular memory, but to a buffer that's entirely in the CPU caches). Some more examples of how to analyze the results: If I remove the norandommap option, fio spends about 1% of the CPU time in a function maintaining the list of which LBAs it has read: 89.63% libc-2.24.so [.] __memmove_avx_unaligned_erms 3.28% [unknown] [k] 0x00007f34da4824ff 0.87% fio [.] axmap_isset 0.58% [kernel] [k] wait_consider_task 0.38% fio [.] get_io_u 0.34% fio [.] io_u_sync_complete If I switch to random writes and remove the zerobuffers option, fio spends about 7% of the CPU time creating non-zero write data (to then write into persistent memory): 79.04% libc-2.24.so [.] __memmove_avx_unaligned_erms 6.98% fio [.] get_io_u 4.24% [unknown] [k] 0x00007f8ee86d34ff 0.85% fio [.] io_queue_event 0.82% fio [.] axmap_isset 0.62% fio [.] io_u_sync_complete 0.62% [kernel] [k] wait_consider_task 0.58% fio [.] td_io_queue 0.45% fio [.] thread_main get_io_u calls small_content_scramble which calls memcpy for 8 bytes at a time. On my system, gcc chose to inline memcpy rather than call the glibc library version, and it ends up just using regular mov instructions: │ memcpy(): │ │ __fortify_function void * │ __NTH (memcpy (void *__restrict __dest, const void *__restrict __src, │ size_t __len)) │ { │ return __builtin___memcpy_chk (__dest, __src, __len, __bos0 (__dest)); 32.56 │ mov %rax,-0x200(%rcx,%rsi,1) 5.18 │ mov 0x0(%rbp),%rsi │ small_content_scramble(): │ get_verify = 1; 2.49 │ add $0x200,%rax 2.81 │ mov 0x8(%rbp),%rdi │ memcpy(): 32.91 │ mov %rsi,-0x10(%rcx) 3.45 │ mov %rdi,-0x8(%rcx) │ small_content_scramble(): │ td->trim_batch = td->o.trim_backlog; │ get_trim = 1; │ } │ │ if (get_trim && get_next_trim(td, io_u)) │ return true; small_content_scramble has hardly been touched since 2011, so it probably hasn't had much performance analysis. One of the few changes made was to add an integer divide by 1000, which is always slow (painfully slow on some CPUs): offset = ((io_u->start_time.tv_nsec/1000) ^ boffset) & 511; perf top doesn't show that taking time - I think the compiler realized it could pull that calculation out of the loop and just do it once. Different compilers and compiler options might not realize that. ��.n��������+%������w��{.n�������^n�r������&��z�ޗ�zf���h���~����������_��+v���)ߣ�