RE: fio 3.2

"Gavriliuk, Anton (HPS Ukraine)" <anton.gavriliuk@xxxxxxx> · Wed, 29 Nov 2017 03:35:03 +0000

Sorry Elliott, seems to be I missed your mail.

1. I recommend using allow_file_create=0 to ensure fio doesn't just create a plain file called "/dev/dax0.0" on your regular storage device and do its I/Os to that.

Done, no visible improvements detected,

dl560g10spmem01:~ # /usr/bin/fio --name=4-rand-rw-3xx --ioengine=mmap --iodepth=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --size=290g --numjobs=16 --group_reporting --runtime=120 --filename=/dev/dax0.0 --allow_file_create=0
4-rand-rw-3xx: (g=0): rw=randrw, bs=4K-256K/4K-256K/4K-256K, ioengine=mmap, iodepth=1
...
fio-2.12
Starting 16 processes
Jobs: 14 (f=14): [m(10),_(1),m(4),_(1)] [78.1% done] [1569MB/29903MB/0KB /s] [30.1K/588K/0 iops] [eta 00m:34s]
4-rand-rw-3xx: (groupid=0, jobs=16): err= 0: pid=18988: Tue Nov 28 21:30:12 2017
  read : io=204940MB, bw=1707.9MB/s, iops=33575, runt=120001msec
    clat (usec): min=0, max=1008, avg= 8.85, stdev= 5.25
     lat (usec): min=0, max=1008, avg= 8.89, stdev= 5.25
    clat percentiles (usec):
     |  1.00th=[    1],  5.00th=[    1], 10.00th=[    2], 20.00th=[    4],
     | 30.00th=[    6], 40.00th=[    8], 50.00th=[    9], 60.00th=[   11],
     | 70.00th=[   11], 80.00th=[   12], 90.00th=[   13], 95.00th=[   14],
     | 99.00th=[   29], 99.50th=[   43], 99.90th=[   48], 99.95th=[   49],
     | 99.99th=[   53]
    bw (KB  /s): min=68034, max=158528, per=6.25%, avg=109348.46, stdev=19939.95
  write: io=3798.1GB, bw=32417MB/s, iops=637449, runt=120001msec
    clat (usec): min=0, max=616, avg=15.13, stdev=10.08
     lat (usec): min=0, max=717, avg=15.17, stdev=10.08
    clat percentiles (usec):
     |  1.00th=[    1],  5.00th=[    2], 10.00th=[    3], 20.00th=[    6],
     | 30.00th=[   11], 40.00th=[   12], 50.00th=[   12], 60.00th=[   20],
     | 70.00th=[   21], 80.00th=[   22], 90.00th=[   23], 95.00th=[   24],
     | 99.00th=[   46], 99.50th=[   86], 99.90th=[   91], 99.95th=[   92],
     | 99.99th=[   96]
    bw (MB  /s): min= 1359, max= 2700, per=6.25%, avg=2026.98, stdev=363.84
    lat (usec) : 2=4.70%, 4=8.45%, 10=11.33%, 20=32.66%, 50=42.21%
    lat (usec) : 100=0.65%, 250=0.01%, 500=0.01%, 750=0.01%
    lat (msec) : 2=0.01%
  cpu          : usr=99.74%, sys=0.21%, ctx=2538, majf=0, minf=2378756
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=4029100/w=76494589/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: io=204940MB, aggrb=1707.9MB/s, minb=1707.9MB/s, maxb=1707.9MB/s, mint=120001msec, maxt=120001msec
  WRITE: io=3798.1GB, aggrb=32417MB/s, minb=32417MB/s, maxb=32417MB/s, mint=120001msec, maxt=120001msec

    HARDCLOCK entries
       Count     Pct  State  Function
        1955  64.97%  USER   __memcpy_avx_unaligned [/lib64/libc-2.22.so]
        1037  34.46%  USER   UNKNOWN
           6   0.20%  SYS    do_page_fault
           5   0.17%  SYS    find_next_iomem_res
           3   0.10%  SYS    pagerange_is_ram_callback
           1   0.03%  SYS    page_add_new_anon_rmap
           1   0.03%  SYS    lookup_memtype
           1   0.03%  SYS    vmf_insert_pfn_pmd

       Count     Pct  HARDCLOCK Stack trace
       ============================================================
           4   0.13%  do_page_fault  page_fault  unknown  |  __memcpy_avx_unaligned
           3   0.10%  find_next_iomem_res  pagerange_is_ram_callback  walk_system_ram_range  pat_pagerange_is_ram  lookup_memtype  track_pfn_insert  vmf_insert_pfn_pmd  dax_dev_pmd_fault  handle_mm_fault  __do_page_fault  do_page_fault  page_fault  unknown  |  __memcpy_avx_unaligned
           2   0.07%  pagerange_is_ram_callback  walk_system_ram_range  pat_pagerange_is_ram  lookup_memtype  track_pfn_insert  vmf_insert_pfn_pmd  dax_dev_pmd_fault  handle_mm_fault  __do_page_fault  do_page_fault  page_fault  unknown  |  __memcpy_avx_unaligned

3. One possible issue is the mmap ioengine MMAP_TOTAL_SZ define "limits us to 1GiB of mapped files in total".  

I expect real software using device DAX will want to map the entire device with one mmap() call, then perform loads, stores, cache flushes, etc. - things that the NVML libraries help it do correctly.  

As is, I think fio keeps mapping and unmapping, which exercises the kernel more than the hardware.

Yes, fio maps entire device by 1 mmap(),

    ******** SYSTEM CALL REPORT ********
    System Call Name     Count     Rate     ElpTime        Avg        Max    Errs    AvSz     KB/s
    futex                    7      0.2    0.101176   0.014454   0.101147       1
       SLEEP                 2      0.0    0.101148   0.050574
          Sleep Func         2             0.101148   0.050574   0.101140  futex_wait_queue_me
       RUNQ                                0.000004
       CPU                                 0.000024
    open                     2      0.0    0.000132   0.000066   0.000130       0
    read                     1      0.0    0.000009   0.000009   0.000009       0       3      0.0
    shmat                    1      0.0    0.000006   0.000006   0.000006       0
    mmap                     1      0.0    0.000003   0.000003   0.000003       0

4. The CPU instructions used by fio depend on the glibc library version.
As mentioned in an earlier fio thread, that changes a lot. With libc2.24.so, random reads seem to be done with rep movsb.

The test box SLES12 SP3 has glibc2.22, so we have to update. But I can't understand what does it mean - " random reads seem to be done with rep movsb " 

-----Original Message-----
From: Elliott, Robert (Persistent Memory) 
Sent: Tuesday, November 28, 2017 6:52 AM
To: Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@xxxxxxx>; Rebecca Cran <rebecca@xxxxxxxxxxxx>; Sitsofe Wheeler <sitsofe@xxxxxxxxx>
Cc: fio@xxxxxxxxxxxxxxx; Kani, Toshimitsu <toshi.kani@xxxxxxx>
Subject: RE: fio 3.2

> -----Original Message-----
> From: Gavriliuk, Anton (HPS Ukraine)
> Sent: Monday, November 27, 2017 7:13 PM
> Subject: RE: fio 3.2
> 
> No, I have true 4 pmem devices right now, 300Gb each,

I think that suggestion was for Sitsofe.

> On Nov 27, 2017, at 12:38 PM, Sitsofe Wheeler <sitsofe@xxxxxxxxx 
> <mailto:sitsofe@xxxxxxxxx>> wrote:
> 
> > Unfortunately I don't have access to a pmem device but let's see how 
> > far we get:
> >
> > On 27 November 2017 at 12:39, Gavriliuk, Anton (HPS Ukraine) 
> > <anton.gavriliuk@xxxxxxx <mailto:anton.gavriliuk@xxxxxxx>> wrote:
> >>
> >> result=$(fio --name=random-writers --ioengine=mmap --iodepth=32 
> >> --rw=randwrite --bs=64k --size=1024m --numjobs=8 
> >> --group_reporting=1 --eta=never --time_based --runtime=60 
> >> --filename=/dev/dax0.0 | grep
> >> WRITE)
...
> > Here you've switched ioengine introducing another place to look.
> > Instead how about this:
> >
...
> > (apparently a size has to be specified when you try to use a 
> > character device - see https://nvdimm.wiki.kernel.org/ )
> >
...
> > (Perhaps the documentation for these ioengines and pmem devices 
> > needs to be improved?)

There are several oddities with device DAX (/dev/dax0.0 character devices, providing a window to persistent memory representing the whole device) compared to filesystem DAX (mounting ext4 or xfs with -o dax and using mmap to access files).

1. I recommend using allow_file_create=0 to ensure fio doesn't just create a plain file called "/dev/dax0.0" on your regular storage device and do its I/Os to that.

2. This script will convert 16 pmem devices into device DAX mode (trashing all data in the process):

#!/bin/bash
for i in {0..15}
do
        echo working on $i
        # -M map puts struct page data in regular memory, not nvdimm memory
        # that's fine for NVDIMM-Ns; intended for larger capacities
        ndctl create-namespace -m dax -M mem -e namespace$i.0 -f done

ndctl carves out some space for device DAX metadata. For 8 GiB of persistent memory, the resulting character device sizes are:
    -M mem: 8587837440 bytes = 8190 MiB (8 GiB - 2 MiB)
    -M dev: 8453619712 bytes = 8062 MiB (8 GiB - 2 MiB - 128 MiB)

Since /dev/dax is a character device, it has no "size", so you must manually tell it the size. We should patch fio to automatically detect that from /sys/class/dax/dax0.0/size in linux.  I don't know how this would work in Windows.

3. One possible issue is the mmap ioengine MMAP_TOTAL_SZ define "limits us to 1GiB of mapped files in total".  

I expect real software using device DAX will want to map the entire device with one mmap() call, then perform loads, stores, cache flushes, etc. - things that the NVML libraries help it do correctly.  

As is, I think fio keeps mapping and unmapping, which exercises the kernel more than the hardware.

4. The CPU instructions used by fio depend on the glibc library version.
As mentioned in an earlier fio thread, that changes a lot. With libc2.24.so, random reads seem to be done with rep movsb.

norandommap, randrepeat=0, zero_buffers, and gtod_reduce=1 help reduce fio overhead at these rates.

I don't think iodepth is used by the mmap ioengine.

5. If the blocksize * number of threads exceeds the CPU cache size, you'll start generating lots of traffic to regular memory in addition to persistent memory.

Example: 36 threads generating 64 KiB reads from /dev/dax0.0 ends up reading read 11 GB/s from persistent memory; the writes stay in CPU cache (no memory write traffic).

|---------------------------------------||---------------------------------------|
|--             Socket  0             --||--             Socket  1             --|
|---------------------------------------||---------------------------------------|
|--     Memory Channel Monitoring     --||--     Memory Channel Monitoring     --|
|---------------------------------------||---------------------------------------|
|-- Mem Ch  0: Reads (MB/s): 11022.33 --||-- Mem Ch  0: Reads (MB/s):    15.60 --|
|--            Writes(MB/s):    22.29 --||--            Writes(MB/s):     7.09 --|
|-- Mem Ch  1: Reads (MB/s):    56.72 --||-- Mem Ch  1: Reads (MB/s):     9.34 --|
|--            Writes(MB/s):    17.03 --||--            Writes(MB/s):     0.81 --|
|-- Mem Ch  4: Reads (MB/s):    60.12 --||-- Mem Ch  4: Reads (MB/s):    15.79 --|
|--            Writes(MB/s):    23.66 --||--            Writes(MB/s):     7.10 --|
|-- Mem Ch  5: Reads (MB/s):    54.94 --||-- Mem Ch  5: Reads (MB/s):     9.65 --|
|--            Writes(MB/s):    17.30 --||--            Writes(MB/s):     1.02 --|
|-- NODE 0 Mem Read (MB/s) : 11194.11 --||-- NODE 1 Mem Read (MB/s) :    50.37 --|
|-- NODE 0 Mem Write(MB/s) :    80.28 --||-- NODE 1 Mem Write(MB/s) :    16.02 --|
|-- NODE 0 P. Write (T/s):     192832 --||-- NODE 1 P. Write (T/s):     190615 --|
|-- NODE 0 Memory (MB/s):    11274.38 --||-- NODE 1 Memory (MB/s):       66.39 --|
|---------------------------------------||---------------------------------------|
|---------------------------------------||---------------------------------------|
|--                   System Read Throughput(MB/s):  11332.59                  --|
|--                  System Write Throughput(MB/s):     96.23                  --|
|--                 System Memory Throughput(MB/s):  11428.82                  --|
|---------------------------------------||---------------------------------------|

Increasing to 2 MiB reads, the bandwidth from persistent memory drops to 9 GB/s, most of those turn into writes to regular memory (with some reads as well as the caches thrash).

|---------------------------------------||---------------------------------------|
|--             Socket  0             --||--             Socket  1             --|
|---------------------------------------||---------------------------------------|
|--     Memory Channel Monitoring     --||--     Memory Channel Monitoring     --|
|---------------------------------------||---------------------------------------|
|-- Mem Ch  0: Reads (MB/s):  9069.96 --||-- Mem Ch  0: Reads (MB/s):    18.33 --|
|--            Writes(MB/s):  2026.71 --||--            Writes(MB/s):     8.44 --|
|-- Mem Ch  1: Reads (MB/s):  1070.84 --||-- Mem Ch  1: Reads (MB/s):    11.08 --|
|--            Writes(MB/s):  2021.61 --||--            Writes(MB/s):     2.11 --|
|-- Mem Ch  4: Reads (MB/s):  1069.20 --||-- Mem Ch  4: Reads (MB/s):    17.85 --|
|--            Writes(MB/s):  2028.27 --||--            Writes(MB/s):     8.51 --|
|-- Mem Ch  5: Reads (MB/s):  1062.81 --||-- Mem Ch  5: Reads (MB/s):    11.59 --|
|--            Writes(MB/s):  2021.74 --||--            Writes(MB/s):     2.31 --|
|-- NODE 0 Mem Read (MB/s) : 12272.82 --||-- NODE 1 Mem Read (MB/s) :    58.85 --|
|-- NODE 0 Mem Write(MB/s) :  8098.33 --||-- NODE 1 Mem Write(MB/s) :    21.36 --|
|-- NODE 0 P. Write (T/s):     220528 --||-- NODE 1 P. Write (T/s):     190643 --|
|-- NODE 0 Memory (MB/s):    20371.14 --||-- NODE 1 Memory (MB/s):       80.21 --|
|---------------------------------------||---------------------------------------|
|---------------------------------------||---------------------------------------|
|--                   System Read Throughput(MB/s):  12331.67                  --|
|--                  System Write Throughput(MB/s):   8119.69                  --|
|--                 System Memory Throughput(MB/s):  20451.36                  --|
|---------------------------------------||---------------------------------------|

Example simple script:

[random-test]
kb_base=1000
ioengine=mmap
iodepth=1
rw=randread
bs=64KiB
size=7GiB
numjobs=36
cpus_allowed_policy=split
cpus_allowed=0-17,36-53
group_reporting=1
time_based
runtime=60
allow_file_create=0
filename=/dev/dax0.0

Example more aggressive script:
[global]
kb_base=1000
ioengine=mmap
iodepth=1
rw=randread
bs=64KiB
ba=64KiB
numjobs=36
cpus_allowed_policy=split
cpus_allowed=0-17,36-53
group_reporting=1
norandommap
randrepeat=0
zero_buffers
gtod_reduce=1
time_based
runtime=60999
allow_file_create=0

[d0]
size=7GiB
filename=/dev/dax0.0

[d1]
size=7GiB
filename=/dev/dax1.0

[d2]
size=7GiB
filename=/dev/dax2.0

[d3]
size=7GiB
filename=/dev/dax3.0

[d4]
size=7GiB
filename=/dev/dax4.0

[d5]
size=7GiB
filename=/dev/dax5.0

[d6]
size=7GiB
filename=/dev/dax6.0

[d7]
size=7GiB
filename=/dev/dax7.0

��.n��������+%������w��{.n�������^n�r������&��z�ޗ�zf���h���~����������_��+v���)ߣ�