Sorry Elliott, seems to be I missed your mail. 1. I recommend using allow_file_create=0 to ensure fio doesn't just create a plain file called "/dev/dax0.0" on your regular storage device and do its I/Os to that. Done, no visible improvements detected, dl560g10spmem01:~ # /usr/bin/fio --name=4-rand-rw-3xx --ioengine=mmap --iodepth=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --size=290g --numjobs=16 --group_reporting --runtime=120 --filename=/dev/dax0.0 --allow_file_create=0 4-rand-rw-3xx: (g=0): rw=randrw, bs=4K-256K/4K-256K/4K-256K, ioengine=mmap, iodepth=1 ... fio-2.12 Starting 16 processes Jobs: 14 (f=14): [m(10),_(1),m(4),_(1)] [78.1% done] [1569MB/29903MB/0KB /s] [30.1K/588K/0 iops] [eta 00m:34s] 4-rand-rw-3xx: (groupid=0, jobs=16): err= 0: pid=18988: Tue Nov 28 21:30:12 2017 read : io=204940MB, bw=1707.9MB/s, iops=33575, runt=120001msec clat (usec): min=0, max=1008, avg= 8.85, stdev= 5.25 lat (usec): min=0, max=1008, avg= 8.89, stdev= 5.25 clat percentiles (usec): | 1.00th=[ 1], 5.00th=[ 1], 10.00th=[ 2], 20.00th=[ 4], | 30.00th=[ 6], 40.00th=[ 8], 50.00th=[ 9], 60.00th=[ 11], | 70.00th=[ 11], 80.00th=[ 12], 90.00th=[ 13], 95.00th=[ 14], | 99.00th=[ 29], 99.50th=[ 43], 99.90th=[ 48], 99.95th=[ 49], | 99.99th=[ 53] bw (KB /s): min=68034, max=158528, per=6.25%, avg=109348.46, stdev=19939.95 write: io=3798.1GB, bw=32417MB/s, iops=637449, runt=120001msec clat (usec): min=0, max=616, avg=15.13, stdev=10.08 lat (usec): min=0, max=717, avg=15.17, stdev=10.08 clat percentiles (usec): | 1.00th=[ 1], 5.00th=[ 2], 10.00th=[ 3], 20.00th=[ 6], | 30.00th=[ 11], 40.00th=[ 12], 50.00th=[ 12], 60.00th=[ 20], | 70.00th=[ 21], 80.00th=[ 22], 90.00th=[ 23], 95.00th=[ 24], | 99.00th=[ 46], 99.50th=[ 86], 99.90th=[ 91], 99.95th=[ 92], | 99.99th=[ 96] bw (MB /s): min= 1359, max= 2700, per=6.25%, avg=2026.98, stdev=363.84 lat (usec) : 2=4.70%, 4=8.45%, 10=11.33%, 20=32.66%, 50=42.21% lat (usec) : 100=0.65%, 250=0.01%, 500=0.01%, 750=0.01% lat (msec) : 2=0.01% cpu : usr=99.74%, sys=0.21%, ctx=2538, majf=0, minf=2378756 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=4029100/w=76494589/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: io=204940MB, aggrb=1707.9MB/s, minb=1707.9MB/s, maxb=1707.9MB/s, mint=120001msec, maxt=120001msec WRITE: io=3798.1GB, aggrb=32417MB/s, minb=32417MB/s, maxb=32417MB/s, mint=120001msec, maxt=120001msec HARDCLOCK entries Count Pct State Function 1955 64.97% USER __memcpy_avx_unaligned [/lib64/libc-2.22.so] 1037 34.46% USER UNKNOWN 6 0.20% SYS do_page_fault 5 0.17% SYS find_next_iomem_res 3 0.10% SYS pagerange_is_ram_callback 1 0.03% SYS page_add_new_anon_rmap 1 0.03% SYS lookup_memtype 1 0.03% SYS vmf_insert_pfn_pmd Count Pct HARDCLOCK Stack trace ============================================================ 4 0.13% do_page_fault page_fault unknown | __memcpy_avx_unaligned 3 0.10% find_next_iomem_res pagerange_is_ram_callback walk_system_ram_range pat_pagerange_is_ram lookup_memtype track_pfn_insert vmf_insert_pfn_pmd dax_dev_pmd_fault handle_mm_fault __do_page_fault do_page_fault page_fault unknown | __memcpy_avx_unaligned 2 0.07% pagerange_is_ram_callback walk_system_ram_range pat_pagerange_is_ram lookup_memtype track_pfn_insert vmf_insert_pfn_pmd dax_dev_pmd_fault handle_mm_fault __do_page_fault do_page_fault page_fault unknown | __memcpy_avx_unaligned 3. One possible issue is the mmap ioengine MMAP_TOTAL_SZ define "limits us to 1GiB of mapped files in total". I expect real software using device DAX will want to map the entire device with one mmap() call, then perform loads, stores, cache flushes, etc. - things that the NVML libraries help it do correctly. As is, I think fio keeps mapping and unmapping, which exercises the kernel more than the hardware. Yes, fio maps entire device by 1 mmap(), ******** SYSTEM CALL REPORT ******** System Call Name Count Rate ElpTime Avg Max Errs AvSz KB/s futex 7 0.2 0.101176 0.014454 0.101147 1 SLEEP 2 0.0 0.101148 0.050574 Sleep Func 2 0.101148 0.050574 0.101140 futex_wait_queue_me RUNQ 0.000004 CPU 0.000024 open 2 0.0 0.000132 0.000066 0.000130 0 read 1 0.0 0.000009 0.000009 0.000009 0 3 0.0 shmat 1 0.0 0.000006 0.000006 0.000006 0 mmap 1 0.0 0.000003 0.000003 0.000003 0 4. The CPU instructions used by fio depend on the glibc library version. As mentioned in an earlier fio thread, that changes a lot. With libc2.24.so, random reads seem to be done with rep movsb. The test box SLES12 SP3 has glibc2.22, so we have to update. But I can't understand what does it mean - " random reads seem to be done with rep movsb " -----Original Message----- From: Elliott, Robert (Persistent Memory) Sent: Tuesday, November 28, 2017 6:52 AM To: Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@xxxxxxx>; Rebecca Cran <rebecca@xxxxxxxxxxxx>; Sitsofe Wheeler <sitsofe@xxxxxxxxx> Cc: fio@xxxxxxxxxxxxxxx; Kani, Toshimitsu <toshi.kani@xxxxxxx> Subject: RE: fio 3.2 > -----Original Message----- > From: Gavriliuk, Anton (HPS Ukraine) > Sent: Monday, November 27, 2017 7:13 PM > Subject: RE: fio 3.2 > > No, I have true 4 pmem devices right now, 300Gb each, I think that suggestion was for Sitsofe. > On Nov 27, 2017, at 12:38 PM, Sitsofe Wheeler <sitsofe@xxxxxxxxx > <mailto:sitsofe@xxxxxxxxx>> wrote: > > > Unfortunately I don't have access to a pmem device but let's see how > > far we get: > > > > On 27 November 2017 at 12:39, Gavriliuk, Anton (HPS Ukraine) > > <anton.gavriliuk@xxxxxxx <mailto:anton.gavriliuk@xxxxxxx>> wrote: > >> > >> result=$(fio --name=random-writers --ioengine=mmap --iodepth=32 > >> --rw=randwrite --bs=64k --size=1024m --numjobs=8 > >> --group_reporting=1 --eta=never --time_based --runtime=60 > >> --filename=/dev/dax0.0 | grep > >> WRITE) ... > > Here you've switched ioengine introducing another place to look. > > Instead how about this: > > ... > > (apparently a size has to be specified when you try to use a > > character device - see https://nvdimm.wiki.kernel.org/ ) > > ... > > (Perhaps the documentation for these ioengines and pmem devices > > needs to be improved?) There are several oddities with device DAX (/dev/dax0.0 character devices, providing a window to persistent memory representing the whole device) compared to filesystem DAX (mounting ext4 or xfs with -o dax and using mmap to access files). 1. I recommend using allow_file_create=0 to ensure fio doesn't just create a plain file called "/dev/dax0.0" on your regular storage device and do its I/Os to that. 2. This script will convert 16 pmem devices into device DAX mode (trashing all data in the process): #!/bin/bash for i in {0..15} do echo working on $i # -M map puts struct page data in regular memory, not nvdimm memory # that's fine for NVDIMM-Ns; intended for larger capacities ndctl create-namespace -m dax -M mem -e namespace$i.0 -f done ndctl carves out some space for device DAX metadata. For 8 GiB of persistent memory, the resulting character device sizes are: -M mem: 8587837440 bytes = 8190 MiB (8 GiB - 2 MiB) -M dev: 8453619712 bytes = 8062 MiB (8 GiB - 2 MiB - 128 MiB) Since /dev/dax is a character device, it has no "size", so you must manually tell it the size. We should patch fio to automatically detect that from /sys/class/dax/dax0.0/size in linux. I don't know how this would work in Windows. 3. One possible issue is the mmap ioengine MMAP_TOTAL_SZ define "limits us to 1GiB of mapped files in total". I expect real software using device DAX will want to map the entire device with one mmap() call, then perform loads, stores, cache flushes, etc. - things that the NVML libraries help it do correctly. As is, I think fio keeps mapping and unmapping, which exercises the kernel more than the hardware. 4. The CPU instructions used by fio depend on the glibc library version. As mentioned in an earlier fio thread, that changes a lot. With libc2.24.so, random reads seem to be done with rep movsb. norandommap, randrepeat=0, zero_buffers, and gtod_reduce=1 help reduce fio overhead at these rates. I don't think iodepth is used by the mmap ioengine. 5. If the blocksize * number of threads exceeds the CPU cache size, you'll start generating lots of traffic to regular memory in addition to persistent memory. Example: 36 threads generating 64 KiB reads from /dev/dax0.0 ends up reading read 11 GB/s from persistent memory; the writes stay in CPU cache (no memory write traffic). |---------------------------------------||---------------------------------------| |-- Socket 0 --||-- Socket 1 --| |---------------------------------------||---------------------------------------| |-- Memory Channel Monitoring --||-- Memory Channel Monitoring --| |---------------------------------------||---------------------------------------| |-- Mem Ch 0: Reads (MB/s): 11022.33 --||-- Mem Ch 0: Reads (MB/s): 15.60 --| |-- Writes(MB/s): 22.29 --||-- Writes(MB/s): 7.09 --| |-- Mem Ch 1: Reads (MB/s): 56.72 --||-- Mem Ch 1: Reads (MB/s): 9.34 --| |-- Writes(MB/s): 17.03 --||-- Writes(MB/s): 0.81 --| |-- Mem Ch 4: Reads (MB/s): 60.12 --||-- Mem Ch 4: Reads (MB/s): 15.79 --| |-- Writes(MB/s): 23.66 --||-- Writes(MB/s): 7.10 --| |-- Mem Ch 5: Reads (MB/s): 54.94 --||-- Mem Ch 5: Reads (MB/s): 9.65 --| |-- Writes(MB/s): 17.30 --||-- Writes(MB/s): 1.02 --| |-- NODE 0 Mem Read (MB/s) : 11194.11 --||-- NODE 1 Mem Read (MB/s) : 50.37 --| |-- NODE 0 Mem Write(MB/s) : 80.28 --||-- NODE 1 Mem Write(MB/s) : 16.02 --| |-- NODE 0 P. Write (T/s): 192832 --||-- NODE 1 P. Write (T/s): 190615 --| |-- NODE 0 Memory (MB/s): 11274.38 --||-- NODE 1 Memory (MB/s): 66.39 --| |---------------------------------------||---------------------------------------| |---------------------------------------||---------------------------------------| |-- System Read Throughput(MB/s): 11332.59 --| |-- System Write Throughput(MB/s): 96.23 --| |-- System Memory Throughput(MB/s): 11428.82 --| |---------------------------------------||---------------------------------------| Increasing to 2 MiB reads, the bandwidth from persistent memory drops to 9 GB/s, most of those turn into writes to regular memory (with some reads as well as the caches thrash). |---------------------------------------||---------------------------------------| |-- Socket 0 --||-- Socket 1 --| |---------------------------------------||---------------------------------------| |-- Memory Channel Monitoring --||-- Memory Channel Monitoring --| |---------------------------------------||---------------------------------------| |-- Mem Ch 0: Reads (MB/s): 9069.96 --||-- Mem Ch 0: Reads (MB/s): 18.33 --| |-- Writes(MB/s): 2026.71 --||-- Writes(MB/s): 8.44 --| |-- Mem Ch 1: Reads (MB/s): 1070.84 --||-- Mem Ch 1: Reads (MB/s): 11.08 --| |-- Writes(MB/s): 2021.61 --||-- Writes(MB/s): 2.11 --| |-- Mem Ch 4: Reads (MB/s): 1069.20 --||-- Mem Ch 4: Reads (MB/s): 17.85 --| |-- Writes(MB/s): 2028.27 --||-- Writes(MB/s): 8.51 --| |-- Mem Ch 5: Reads (MB/s): 1062.81 --||-- Mem Ch 5: Reads (MB/s): 11.59 --| |-- Writes(MB/s): 2021.74 --||-- Writes(MB/s): 2.31 --| |-- NODE 0 Mem Read (MB/s) : 12272.82 --||-- NODE 1 Mem Read (MB/s) : 58.85 --| |-- NODE 0 Mem Write(MB/s) : 8098.33 --||-- NODE 1 Mem Write(MB/s) : 21.36 --| |-- NODE 0 P. Write (T/s): 220528 --||-- NODE 1 P. Write (T/s): 190643 --| |-- NODE 0 Memory (MB/s): 20371.14 --||-- NODE 1 Memory (MB/s): 80.21 --| |---------------------------------------||---------------------------------------| |---------------------------------------||---------------------------------------| |-- System Read Throughput(MB/s): 12331.67 --| |-- System Write Throughput(MB/s): 8119.69 --| |-- System Memory Throughput(MB/s): 20451.36 --| |---------------------------------------||---------------------------------------| Example simple script: [random-test] kb_base=1000 ioengine=mmap iodepth=1 rw=randread bs=64KiB size=7GiB numjobs=36 cpus_allowed_policy=split cpus_allowed=0-17,36-53 group_reporting=1 time_based runtime=60 allow_file_create=0 filename=/dev/dax0.0 Example more aggressive script: [global] kb_base=1000 ioengine=mmap iodepth=1 rw=randread bs=64KiB ba=64KiB numjobs=36 cpus_allowed_policy=split cpus_allowed=0-17,36-53 group_reporting=1 norandommap randrepeat=0 zero_buffers gtod_reduce=1 time_based runtime=60999 allow_file_create=0 [d0] size=7GiB filename=/dev/dax0.0 [d1] size=7GiB filename=/dev/dax1.0 [d2] size=7GiB filename=/dev/dax2.0 [d3] size=7GiB filename=/dev/dax3.0 [d4] size=7GiB filename=/dev/dax4.0 [d5] size=7GiB filename=/dev/dax5.0 [d6] size=7GiB filename=/dev/dax6.0 [d7] size=7GiB filename=/dev/dax7.0 ��.n��������+%������w��{.n�������^n�r������&��z�ޗ�zf���h���~����������_��+v���)ߣ�