RE: fio 3.2

"Elliott, Robert (Persistent Memory)" <elliott@xxxxxxx> · Wed, 29 Nov 2017 04:13:53 +0000

> -----Original Message-----
> From: Gavriliuk, Anton (HPS Ukraine)
> Sent: Tuesday, November 28, 2017 9:35 PM
> To: Elliott, Robert (Persistent Memory) <elliott@xxxxxxx>; Rebecca Cran
> <rebecca@xxxxxxxxxxxx>; Sitsofe Wheeler <sitsofe@xxxxxxxxx>
> Cc: fio@xxxxxxxxxxxxxxx; Kani, Toshimitsu <toshi.kani@xxxxxxx>
> Subject: RE: fio 3.2

> 4. The CPU instructions used by fio depend on the glibc library version.
> As mentioned in an earlier fio thread, that changes a lot. With
> libc2.24.so, random reads seem to be done with rep movsb.
> 
> The test box SLES12 SP3 has glibc2.22, so we have to update. But I can't
> understand what does it mean - " random reads seem to be done with rep
> movsb "

While the test is running, run
	perf top

and select one of the busiest functions (hopefully a memcpy function):

  90.14%  libc-2.24.so         [.] __memmove_avx_unaligned_erms
   3.85%  [unknown]            [k] 0x00007f7414f5c4ff
   0.53%  [kernel]             [k] wait_consider_task
   0.33%  fio                  [.] get_io_u
   0.29%  fio                  [.] td_io_queue
   0.26%  fio                  [.] io_u_sync_complete

Hit enter twice to get to the assembly language code.  This shows you
the amount of time the CPUs are spending in each instruction (based on
sampling):

  0.03 │ 2b:   cmp    __x86_shared_non_temporal_threshold,%rdx
  0.01 │     ↓ jae    15d
       │       cmp    %rsi,%rdi
  0.00 │     ↓ jb     4c
       │     ↓ je     51
       │       lea    (%rsi,%rdx,1),%r9
       │       cmp    %r9,%rdi
       │     ↓ jb     211
       │ 4c:   mov    %rdx,%rcx
 99.88 │       rep    movsb %ds:(%rsi),%es:(%rdi)
  0.08 │ 51: ← retq
       │ 52:   cmp    $0x10,%dl

That means 99.88% of 90.14% of the time is spend on rep movsb, presumably
reading from persistent memory (and writing to regular memory, but to
a buffer that's entirely in the CPU caches).

Some more examples of how to analyze the results:

If I remove the norandommap option, fio spends about 1% of the CPU time 
in a function maintaining the list of which LBAs it has read:
  89.63%  libc-2.24.so                [.] __memmove_avx_unaligned_erms
   3.28%  [unknown]                   [k] 0x00007f34da4824ff
   0.87%  fio                         [.] axmap_isset 
   0.58%  [kernel]                    [k] wait_consider_task
   0.38%  fio                         [.] get_io_u
   0.34%  fio                         [.] io_u_sync_complete

If I switch to random writes and remove the zerobuffers option, fio spends
about 7% of the CPU time creating non-zero write data (to then write
into persistent memory):
  79.04%  libc-2.24.so        [.] __memmove_avx_unaligned_erms
   6.98%  fio                 [.] get_io_u
   4.24%  [unknown]           [k] 0x00007f8ee86d34ff
   0.85%  fio                 [.] io_queue_event
   0.82%  fio                 [.] axmap_isset
   0.62%  fio                 [.] io_u_sync_complete
   0.62%  [kernel]            [k] wait_consider_task
   0.58%  fio                 [.] td_io_queue
   0.45%  fio                 [.] thread_main

get_io_u calls small_content_scramble which calls memcpy for 8
bytes at a time.  On my system, gcc chose to inline memcpy
rather than call the glibc library version, and it ends up just
using regular mov instructions:

       │      memcpy():
       │
       │      __fortify_function void *
       │      __NTH (memcpy (void *__restrict __dest, const void *__restrict __src,
       │                     size_t __len))
       │      {
       │        return __builtin___memcpy_chk (__dest, __src, __len, __bos0 (__dest));
 32.56 │        mov    %rax,-0x200(%rcx,%rsi,1)
  5.18 │        mov    0x0(%rbp),%rsi
       │      small_content_scramble():
       │                              get_verify = 1;
  2.49 │        add    $0x200,%rax
  2.81 │        mov    0x8(%rbp),%rdi
       │      memcpy():
 32.91 │        mov    %rsi,-0x10(%rcx)
  3.45 │        mov    %rdi,-0x8(%rcx)
       │      small_content_scramble():
       │                                      td->trim_batch = td->o.trim_backlog;
       │                              get_trim = 1;
       │                      }
       │
       │                      if (get_trim && get_next_trim(td, io_u))
       │                              return true;

small_content_scramble has hardly been touched since 2011, so it probably
hasn't had much performance analysis.  

One of the few changes made was to add an integer divide by 1000, which 
is always slow (painfully slow on some CPUs):

	offset = ((io_u->start_time.tv_nsec/1000) ^ boffset) & 511;

perf top doesn't show that taking time - I think the compiler realized
it could pull that calculation out of the loop and just do it once.  Different
compilers and compiler options might not realize that.

��.n��������+%������w��{.n�������^n�r������&��z�ޗ�zf���h���~����������_��+v���)ߣ�