RE: fio 3.2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> -----Original Message-----
> From: Gavriliuk, Anton (HPS Ukraine)
> Sent: Monday, November 27, 2017 7:13 PM
> Subject: RE: fio 3.2
> 
> No, I have true 4 pmem devices right now, 300Gb each,

I think that suggestion was for Sitsofe.


> On Nov 27, 2017, at 12:38 PM, Sitsofe Wheeler <sitsofe@xxxxxxxxx
> <mailto:sitsofe@xxxxxxxxx>> wrote:
> 
> > Unfortunately I don't have access to a pmem device but let's see how
> > far we get:
> >
> > On 27 November 2017 at 12:39, Gavriliuk, Anton (HPS Ukraine)
> > <anton.gavriliuk@xxxxxxx <mailto:anton.gavriliuk@xxxxxxx>> wrote:
> >>
> >> result=$(fio --name=random-writers --ioengine=mmap --iodepth=32
> >> --rw=randwrite --bs=64k --size=1024m --numjobs=8 --group_reporting=1
> >> --eta=never --time_based --runtime=60 --filename=/dev/dax0.0 | grep
> >> WRITE)
...
> > Here you've switched ioengine introducing another place to look.
> > Instead how about this:
> >
...
> > (apparently a size has to be specified when you try to use a character
> > device - see https://nvdimm.wiki.kernel.org/ )
> >
...
> > (Perhaps the documentation for these ioengines and pmem devices needs
> > to be improved?)

There are several oddities with device DAX (/dev/dax0.0 character devices,
providing a window to persistent memory representing the whole device) compared
to filesystem DAX (mounting ext4 or xfs with -o dax and using mmap to access
files).

1. I recommend using allow_file_create=0 to ensure fio doesn't just create
a plain file called "/dev/dax0.0" on your regular storage device and do its
I/Os to that.


2. This script will convert 16 pmem devices into device DAX mode (trashing
all data in the process):

#!/bin/bash
for i in {0..15}
do
        echo working on $i
        # -M map puts struct page data in regular memory, not nvdimm memory
        # that's fine for NVDIMM-Ns; intended for larger capacities
        ndctl create-namespace -m dax -M mem -e namespace$i.0 -f
done

ndctl carves out some space for device DAX metadata. For 8 GiB of
persistent memory, the resulting character device sizes are:
    -M mem: 8587837440 bytes = 8190 MiB (8 GiB - 2 MiB)
    -M dev: 8453619712 bytes = 8062 MiB (8 GiB - 2 MiB - 128 MiB)

Since /dev/dax is a character device, it has no "size", so you must manually
tell it the size. We should patch fio to automatically detect that from
/sys/class/dax/dax0.0/size in linux.  I don't know how this would work in
Windows.


3. One possible issue is the mmap ioengine MMAP_TOTAL_SZ define "limits
us to 1GiB of mapped files in total".  

I expect real software using device DAX will want to map the entire device
with one mmap() call, then perform loads, stores, cache flushes, etc. -
things that the NVML libraries help it do correctly.  

As is, I think fio keeps mapping and unmapping, which exercises the kernel
more than the hardware.


4. The CPU instructions used by fio depend on the glibc library version.
As mentioned in an earlier fio thread, that changes a lot. With libc2.24.so,
random reads seem to be done with rep movsb.

norandommap, randrepeat=0, zero_buffers, and gtod_reduce=1 help reduce
fio overhead at these rates.

I don't think iodepth is used by the mmap ioengine.


5. If the blocksize * number of threads exceeds the CPU cache size, you'll start
generating lots of traffic to regular memory in addition to persistent memory.

Example: 36 threads generating 64 KiB reads from /dev/dax0.0 ends up reading
read 11 GB/s from persistent memory; the writes stay in CPU cache 
(no memory write traffic).

|---------------------------------------||---------------------------------------|
|--             Socket  0             --||--             Socket  1             --|
|---------------------------------------||---------------------------------------|
|--     Memory Channel Monitoring     --||--     Memory Channel Monitoring     --|
|---------------------------------------||---------------------------------------|
|-- Mem Ch  0: Reads (MB/s): 11022.33 --||-- Mem Ch  0: Reads (MB/s):    15.60 --|
|--            Writes(MB/s):    22.29 --||--            Writes(MB/s):     7.09 --|
|-- Mem Ch  1: Reads (MB/s):    56.72 --||-- Mem Ch  1: Reads (MB/s):     9.34 --|
|--            Writes(MB/s):    17.03 --||--            Writes(MB/s):     0.81 --|
|-- Mem Ch  4: Reads (MB/s):    60.12 --||-- Mem Ch  4: Reads (MB/s):    15.79 --|
|--            Writes(MB/s):    23.66 --||--            Writes(MB/s):     7.10 --|
|-- Mem Ch  5: Reads (MB/s):    54.94 --||-- Mem Ch  5: Reads (MB/s):     9.65 --|
|--            Writes(MB/s):    17.30 --||--            Writes(MB/s):     1.02 --|
|-- NODE 0 Mem Read (MB/s) : 11194.11 --||-- NODE 1 Mem Read (MB/s) :    50.37 --|
|-- NODE 0 Mem Write(MB/s) :    80.28 --||-- NODE 1 Mem Write(MB/s) :    16.02 --|
|-- NODE 0 P. Write (T/s):     192832 --||-- NODE 1 P. Write (T/s):     190615 --|
|-- NODE 0 Memory (MB/s):    11274.38 --||-- NODE 1 Memory (MB/s):       66.39 --|
|---------------------------------------||---------------------------------------|
|---------------------------------------||---------------------------------------|
|--                   System Read Throughput(MB/s):  11332.59                  --|
|--                  System Write Throughput(MB/s):     96.23                  --|
|--                 System Memory Throughput(MB/s):  11428.82                  --|
|---------------------------------------||---------------------------------------|


Increasing to 2 MiB reads, the bandwidth from persistent memory drops to 9 GB/s,
most of those turn into writes to regular memory (with some reads as well as
the caches thrash).

|---------------------------------------||---------------------------------------|
|--             Socket  0             --||--             Socket  1             --|
|---------------------------------------||---------------------------------------|
|--     Memory Channel Monitoring     --||--     Memory Channel Monitoring     --|
|---------------------------------------||---------------------------------------|
|-- Mem Ch  0: Reads (MB/s):  9069.96 --||-- Mem Ch  0: Reads (MB/s):    18.33 --|
|--            Writes(MB/s):  2026.71 --||--            Writes(MB/s):     8.44 --|
|-- Mem Ch  1: Reads (MB/s):  1070.84 --||-- Mem Ch  1: Reads (MB/s):    11.08 --|
|--            Writes(MB/s):  2021.61 --||--            Writes(MB/s):     2.11 --|
|-- Mem Ch  4: Reads (MB/s):  1069.20 --||-- Mem Ch  4: Reads (MB/s):    17.85 --|
|--            Writes(MB/s):  2028.27 --||--            Writes(MB/s):     8.51 --|
|-- Mem Ch  5: Reads (MB/s):  1062.81 --||-- Mem Ch  5: Reads (MB/s):    11.59 --|
|--            Writes(MB/s):  2021.74 --||--            Writes(MB/s):     2.31 --|
|-- NODE 0 Mem Read (MB/s) : 12272.82 --||-- NODE 1 Mem Read (MB/s) :    58.85 --|
|-- NODE 0 Mem Write(MB/s) :  8098.33 --||-- NODE 1 Mem Write(MB/s) :    21.36 --|
|-- NODE 0 P. Write (T/s):     220528 --||-- NODE 1 P. Write (T/s):     190643 --|
|-- NODE 0 Memory (MB/s):    20371.14 --||-- NODE 1 Memory (MB/s):       80.21 --|
|---------------------------------------||---------------------------------------|
|---------------------------------------||---------------------------------------|
|--                   System Read Throughput(MB/s):  12331.67                  --|
|--                  System Write Throughput(MB/s):   8119.69                  --|
|--                 System Memory Throughput(MB/s):  20451.36                  --|
|---------------------------------------||---------------------------------------|


Example simple script:

[random-test]
kb_base=1000
ioengine=mmap
iodepth=1
rw=randread
bs=64KiB
size=7GiB
numjobs=36
cpus_allowed_policy=split
cpus_allowed=0-17,36-53
group_reporting=1
time_based
runtime=60
allow_file_create=0
filename=/dev/dax0.0


Example more aggressive script:
[global]
kb_base=1000
ioengine=mmap
iodepth=1
rw=randread
bs=64KiB
ba=64KiB
numjobs=36
cpus_allowed_policy=split
cpus_allowed=0-17,36-53
group_reporting=1
norandommap
randrepeat=0
zero_buffers
gtod_reduce=1
time_based
runtime=60999
allow_file_create=0

[d0]
size=7GiB
filename=/dev/dax0.0

[d1]
size=7GiB
filename=/dev/dax1.0

[d2]
size=7GiB
filename=/dev/dax2.0

[d3]
size=7GiB
filename=/dev/dax3.0

[d4]
size=7GiB
filename=/dev/dax4.0

[d5]
size=7GiB
filename=/dev/dax5.0

[d6]
size=7GiB
filename=/dev/dax6.0

[d7]
size=7GiB
filename=/dev/dax7.0



��.n��������+%������w��{.n�������^n�r������&��z�ޗ�zf���h���~����������_��+v���)ߣ�

[Index of Archives]     [Linux Kernel]     [Linux SCSI]     [Linux IDE]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux SCSI]

  Powered by Linux