RE: fio 3.2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



It's important, because fio shows 6.9us clat for reads and 13us for writes, long time..., access to the PMEM just a few hundred ns.  So I need to understand where we spend time.

   read: IOPS=36.7k, BW=1865MiB/s (1956MB/s)(232GiB/127291msec)
    clat (nsec): min=163, max=16033k, avg=6827.63, stdev=9340.91
     lat (nsec): min=188, max=16033k, avg=6860.92, stdev=9341.21
  write: IOPS=697k, BW=34.6GiB/s (37.2GB/s)(4408GiB/127291msec)
    clat (nsec): min=82, max=48028k, avg=12479.56, stdev=13459.80
     lat (nsec): min=106, max=48028k, avg=12524.94, stdev=13464.29

-----Original Message-----
From: Gavriliuk, Anton (HPS Ukraine) 
Sent: Monday, December 04, 2017 7:12 PM
To: 'Jens Axboe' <axboe@xxxxxxxxx>; Robert Elliott (Persistent Memory) <elliott@xxxxxxx>; 'Rebecca Cran' <rebecca@xxxxxxxxxxxx>; 'Sitsofe Wheeler' <sitsofe@xxxxxxxxx>
Cc: 'fio@xxxxxxxxxxxxxxx' <fio@xxxxxxxxxxxxxxx>; Kani, Toshimitsu <toshi.kani@xxxxxxx>
Subject: RE: fio 3.2

I have run next command several times,

perf record /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --size=290g --numa_cpu_nodes=0

each perf report shows me toppest clock_thread_fn(),

Samples: 8M of event 'cycles:ppp', Event count (approx.): 23391179916235
Overhead  Command  Shared Object       Symbol
  75.27%  fio      fio                 [.] clock_thread_fn
  13.68%  fio      libc-2.22.so        [.] __memcpy_avx_unaligned
  10.05%  fio      fio                 [.] fill_random_buf

Samples: 8M of event 'cycles:ppp', Event count (approx.): 28889015047191
Overhead  Command  Shared Object       Symbol
  79.97%  fio      fio                 [.] clock_thread_fn
  11.08%  fio      libc-2.22.so        [.] __memcpy_avx_unaligned
   8.14%  fio      fio                 [.] fill_random_buf
   0.13%  fio      fio                 [.] get_io_u

Samples: 8M of event 'cycles:ppp', Event count (approx.): 57473976919058
Overhead  Command  Shared Object       Symbol
  89.94%  fio      fio                 [.] clock_thread_fn
   5.57%  fio      libc-2.22.so        [.] __memcpy_avx_unaligned
   4.09%  fio      fio                 [.] fill_random_buf
   0.07%  fio      fio                 [.] get_io_u

Samples: 8M of event 'cycles:ppp', Event count (approx.): 160828042961056
Overhead  Command  Shared Object       Symbol                                                                                                               ◆
  92.30%  fio      fio                 [.] clock_thread_fn                                                                                                  ▒
   4.10%  fio      [kernel.kallsyms]   [k] filemap_map_pages                                                                                                ▒
   1.99%  fio      libc-2.22.so        [.] __memcpy_avx_unaligned                                                                                           ▒
   1.46%  fio      fio                 [.] fill_random_buf                                                                                                  ▒
   0.02%  fio      fio                 [.] get_io_u                                                                                                         ▒
   0.02%  fio      fio                 [.] io_u_sync_complete      

Samples: 8M of event 'cycles:ppp', Event count (approx.): 163028870650810
Overhead  Command  Shared Object       Symbol
  87.68%  fio      fio                 [.] clock_thread_fn
   8.77%  fio      [kernel.kallsyms]   [k] run_timer_softirq
   1.96%  fio      libc-2.22.so        [.] __memcpy_avx_unaligned
   1.44%  fio      fio                 [.] fill_random_buf
   0.02%  fio      fio                 [.] get_io_u
   0.02%  fio      fio                 [.] io_u_sync_complete

 89.62 │       lock   cmpxchg %edi,(%rsi)                                                                                                                   ▒
       │     clock_thread_fn():                                                                                                                             ▒
       │                             seq = *t->seq;                                                                                                         ▒
       │                             if (seq == UINT_MAX)                                                                                                   ▒
       │                                     break;                                                                                                         ▒
       │                             __sync_synchronize();                                                                                                  ▒
       │                             tsc = get_cpu_clock();                                                                                                 ▒
       │                     } while (seq != atomic32_compare_and_swap(t->seq, seq, seq + 1));                                                              ▒
       │       cmp    %eax,%ecx                                                                                                                             ▒
       │     ↓ je     118                                                                                                                                   ◆
       │                     uint32_t seq;                                                                                                                  ▒
       │                     uint64_t tsc;                                                                                                                  ▒
       │                                                                                                                                                    ▒
       │                     c->cpu = t->cpu;                                                                                                               ▒
       │                     do {                                                                                                                           ▒
       │                             seq = *t->seq;                                                                                                         ▒
       │ c9:   mov    0x68(%rbx),%rcx                                                                                                                       ▒
 10.01 │       mov    (%rcx),%ecx        

May you please clarify, is it ok ?, what is the lock contention...

Anton                                                                                                      				  ▒                                         

-----Original Message-----
From: Gavriliuk, Anton (HPS Ukraine)
Sent: Sunday, December 03, 2017 5:24 AM
To: 'Jens Axboe' <axboe@xxxxxxxxx>; Robert Elliott (Persistent Memory) <elliott@xxxxxxx>; Rebecca Cran <rebecca@xxxxxxxxxxxx>; Sitsofe Wheeler <sitsofe@xxxxxxxxx>
Cc: fio@xxxxxxxxxxxxxxx; Kani, Toshimitsu <toshi.kani@xxxxxxx>
Subject: RE: fio 3.2

Ok, thanks.  Now I built latest fio available :-)

dl560g10spmem01:~ # /usr/local/bin/fio --filename=/dev/dax0.0 --rw=randrw --norandommap --randrepeat=0 --ioengine=mmap --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw-3xx --size=290g --numa_cpu_nodes=0
4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 4096B-256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1 ...
fio-3.2-64-gaf7fb
Starting 16 processes
Jobs: 16 (f=16): [m(16)][8.0%][r=2138MiB/s,w=39.7GiB/s][r=42.2k,w=799k IOPS][eta 01m:44s]
       
       From perf top,
       
Samples: 1M of event 'cycles:ppp', Event count (approx.): 503978848613
Overhead  Shared Object       Symbol                                                                                                                        ◆
  93.34%  libc-2.22.so        [.] __memcpy_avx_unaligned      

Just can't drill it down,

__memcpy_avx_unaligned  /lib64/libc-2.22.so Huh? No selection. Report to linux-kernel@xxxxxxxxxxxxxxx 

While other functions can,

get_io_u  /usr/local/bin/fio
  0.08 │        xor    %eax,%esi
  0.01 │        shl    $0x6,%rsi
  0.02 │        and    $0x1c0,%esi
  0.00 │        sub    %rax,%rsi
  0.12 │        add    %rdx,%rsi
  0.00 │        sub    %rax,%rdx
       │        nop
       │                      /*
       │                       * Fill offset into start of cacheline, time into end
       │                       * of cacheline
       │                       */
       │                      iptr = (void *) p + (offset << 6);
       │                      *iptr = boffset;
 16.01 │ 230:   mov    %rax,(%rsi,%rax,1)
       │
       │                      iptr = (void *) p + 64 - 2 * sizeof(uint64_t);
       │                      iptr[0] = io_u->start_time.tv_sec;
  7.03 │        mov    0x0(%rbp),%rcx
 13.52 │        mov    %rcx,0x30(%rdx,%rax,1)                   

-----Original Message-----
From: Jens Axboe [mailto:axboe@xxxxxxxxx]
Sent: Thursday, November 30, 2017 11:24 PM
To: Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@xxxxxxx>; Robert Elliott (Persistent Memory) <elliott@xxxxxxx>; Rebecca Cran <rebecca@xxxxxxxxxxxx>; Sitsofe Wheeler <sitsofe@xxxxxxxxx>
Cc: fio@xxxxxxxxxxxxxxx; Kani, Toshimitsu <toshi.kani@xxxxxxx>
Subject: Re: fio 3.2

On 11/30/2017 07:17 AM, Gavriliuk, Anton (HPS Ukraine) wrote:
> It's there any chance to fix it ?
> 
> dl560g10spmem01:/var/work # /usr/local/bin/fio --filename=/dev/dax0.0 
> --rw=randrw --refill_buffers --randrepeat=0 --ioengine=mmap
> --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5
> --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting 
> --name=4-rand-rw-3xx --size=290g
> 4-rand-rw-3xx: (g=0): rw=randrw, bs=(R) 4096B-256KiB, (W) 
> 4096B-256KiB, (T) 4096B-256KiB, ioengine=mmap, iodepth=1 ...
> fio-2.99
> Starting 16 processes
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0
> 
> 4-rand-rw-3xx: failed to get file size of /dev/dax0.0

Should already be fixed as of yesterday morning.

--
Jens Axboe

��.n��������+%������w��{.n�������^n�r������&��z�ޗ�zf���h���~����������_��+v���)ߣ�

[Index of Archives]     [Linux Kernel]     [Linux SCSI]     [Linux IDE]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux SCSI]

  Powered by Linux