> -----Original Message----- > From: Gavriliuk, Anton (HPS Ukraine) > Sent: Monday, November 27, 2017 7:13 PM > Subject: RE: fio 3.2 > > No, I have true 4 pmem devices right now, 300Gb each, I think that suggestion was for Sitsofe. > On Nov 27, 2017, at 12:38 PM, Sitsofe Wheeler <sitsofe@xxxxxxxxx > <mailto:sitsofe@xxxxxxxxx>> wrote: > > > Unfortunately I don't have access to a pmem device but let's see how > > far we get: > > > > On 27 November 2017 at 12:39, Gavriliuk, Anton (HPS Ukraine) > > <anton.gavriliuk@xxxxxxx <mailto:anton.gavriliuk@xxxxxxx>> wrote: > >> > >> result=$(fio --name=random-writers --ioengine=mmap --iodepth=32 > >> --rw=randwrite --bs=64k --size=1024m --numjobs=8 --group_reporting=1 > >> --eta=never --time_based --runtime=60 --filename=/dev/dax0.0 | grep > >> WRITE) ... > > Here you've switched ioengine introducing another place to look. > > Instead how about this: > > ... > > (apparently a size has to be specified when you try to use a character > > device - see https://nvdimm.wiki.kernel.org/ ) > > ... > > (Perhaps the documentation for these ioengines and pmem devices needs > > to be improved?) There are several oddities with device DAX (/dev/dax0.0 character devices, providing a window to persistent memory representing the whole device) compared to filesystem DAX (mounting ext4 or xfs with -o dax and using mmap to access files). 1. I recommend using allow_file_create=0 to ensure fio doesn't just create a plain file called "/dev/dax0.0" on your regular storage device and do its I/Os to that. 2. This script will convert 16 pmem devices into device DAX mode (trashing all data in the process): #!/bin/bash for i in {0..15} do echo working on $i # -M map puts struct page data in regular memory, not nvdimm memory # that's fine for NVDIMM-Ns; intended for larger capacities ndctl create-namespace -m dax -M mem -e namespace$i.0 -f done ndctl carves out some space for device DAX metadata. For 8 GiB of persistent memory, the resulting character device sizes are: -M mem: 8587837440 bytes = 8190 MiB (8 GiB - 2 MiB) -M dev: 8453619712 bytes = 8062 MiB (8 GiB - 2 MiB - 128 MiB) Since /dev/dax is a character device, it has no "size", so you must manually tell it the size. We should patch fio to automatically detect that from /sys/class/dax/dax0.0/size in linux. I don't know how this would work in Windows. 3. One possible issue is the mmap ioengine MMAP_TOTAL_SZ define "limits us to 1GiB of mapped files in total". I expect real software using device DAX will want to map the entire device with one mmap() call, then perform loads, stores, cache flushes, etc. - things that the NVML libraries help it do correctly. As is, I think fio keeps mapping and unmapping, which exercises the kernel more than the hardware. 4. The CPU instructions used by fio depend on the glibc library version. As mentioned in an earlier fio thread, that changes a lot. With libc2.24.so, random reads seem to be done with rep movsb. norandommap, randrepeat=0, zero_buffers, and gtod_reduce=1 help reduce fio overhead at these rates. I don't think iodepth is used by the mmap ioengine. 5. If the blocksize * number of threads exceeds the CPU cache size, you'll start generating lots of traffic to regular memory in addition to persistent memory. Example: 36 threads generating 64 KiB reads from /dev/dax0.0 ends up reading read 11 GB/s from persistent memory; the writes stay in CPU cache (no memory write traffic). |---------------------------------------||---------------------------------------| |-- Socket 0 --||-- Socket 1 --| |---------------------------------------||---------------------------------------| |-- Memory Channel Monitoring --||-- Memory Channel Monitoring --| |---------------------------------------||---------------------------------------| |-- Mem Ch 0: Reads (MB/s): 11022.33 --||-- Mem Ch 0: Reads (MB/s): 15.60 --| |-- Writes(MB/s): 22.29 --||-- Writes(MB/s): 7.09 --| |-- Mem Ch 1: Reads (MB/s): 56.72 --||-- Mem Ch 1: Reads (MB/s): 9.34 --| |-- Writes(MB/s): 17.03 --||-- Writes(MB/s): 0.81 --| |-- Mem Ch 4: Reads (MB/s): 60.12 --||-- Mem Ch 4: Reads (MB/s): 15.79 --| |-- Writes(MB/s): 23.66 --||-- Writes(MB/s): 7.10 --| |-- Mem Ch 5: Reads (MB/s): 54.94 --||-- Mem Ch 5: Reads (MB/s): 9.65 --| |-- Writes(MB/s): 17.30 --||-- Writes(MB/s): 1.02 --| |-- NODE 0 Mem Read (MB/s) : 11194.11 --||-- NODE 1 Mem Read (MB/s) : 50.37 --| |-- NODE 0 Mem Write(MB/s) : 80.28 --||-- NODE 1 Mem Write(MB/s) : 16.02 --| |-- NODE 0 P. Write (T/s): 192832 --||-- NODE 1 P. Write (T/s): 190615 --| |-- NODE 0 Memory (MB/s): 11274.38 --||-- NODE 1 Memory (MB/s): 66.39 --| |---------------------------------------||---------------------------------------| |---------------------------------------||---------------------------------------| |-- System Read Throughput(MB/s): 11332.59 --| |-- System Write Throughput(MB/s): 96.23 --| |-- System Memory Throughput(MB/s): 11428.82 --| |---------------------------------------||---------------------------------------| Increasing to 2 MiB reads, the bandwidth from persistent memory drops to 9 GB/s, most of those turn into writes to regular memory (with some reads as well as the caches thrash). |---------------------------------------||---------------------------------------| |-- Socket 0 --||-- Socket 1 --| |---------------------------------------||---------------------------------------| |-- Memory Channel Monitoring --||-- Memory Channel Monitoring --| |---------------------------------------||---------------------------------------| |-- Mem Ch 0: Reads (MB/s): 9069.96 --||-- Mem Ch 0: Reads (MB/s): 18.33 --| |-- Writes(MB/s): 2026.71 --||-- Writes(MB/s): 8.44 --| |-- Mem Ch 1: Reads (MB/s): 1070.84 --||-- Mem Ch 1: Reads (MB/s): 11.08 --| |-- Writes(MB/s): 2021.61 --||-- Writes(MB/s): 2.11 --| |-- Mem Ch 4: Reads (MB/s): 1069.20 --||-- Mem Ch 4: Reads (MB/s): 17.85 --| |-- Writes(MB/s): 2028.27 --||-- Writes(MB/s): 8.51 --| |-- Mem Ch 5: Reads (MB/s): 1062.81 --||-- Mem Ch 5: Reads (MB/s): 11.59 --| |-- Writes(MB/s): 2021.74 --||-- Writes(MB/s): 2.31 --| |-- NODE 0 Mem Read (MB/s) : 12272.82 --||-- NODE 1 Mem Read (MB/s) : 58.85 --| |-- NODE 0 Mem Write(MB/s) : 8098.33 --||-- NODE 1 Mem Write(MB/s) : 21.36 --| |-- NODE 0 P. Write (T/s): 220528 --||-- NODE 1 P. Write (T/s): 190643 --| |-- NODE 0 Memory (MB/s): 20371.14 --||-- NODE 1 Memory (MB/s): 80.21 --| |---------------------------------------||---------------------------------------| |---------------------------------------||---------------------------------------| |-- System Read Throughput(MB/s): 12331.67 --| |-- System Write Throughput(MB/s): 8119.69 --| |-- System Memory Throughput(MB/s): 20451.36 --| |---------------------------------------||---------------------------------------| Example simple script: [random-test] kb_base=1000 ioengine=mmap iodepth=1 rw=randread bs=64KiB size=7GiB numjobs=36 cpus_allowed_policy=split cpus_allowed=0-17,36-53 group_reporting=1 time_based runtime=60 allow_file_create=0 filename=/dev/dax0.0 Example more aggressive script: [global] kb_base=1000 ioengine=mmap iodepth=1 rw=randread bs=64KiB ba=64KiB numjobs=36 cpus_allowed_policy=split cpus_allowed=0-17,36-53 group_reporting=1 norandommap randrepeat=0 zero_buffers gtod_reduce=1 time_based runtime=60999 allow_file_create=0 [d0] size=7GiB filename=/dev/dax0.0 [d1] size=7GiB filename=/dev/dax1.0 [d2] size=7GiB filename=/dev/dax2.0 [d3] size=7GiB filename=/dev/dax3.0 [d4] size=7GiB filename=/dev/dax4.0 [d5] size=7GiB filename=/dev/dax5.0 [d6] size=7GiB filename=/dev/dax6.0 [d7] size=7GiB filename=/dev/dax7.0 ��.n��������+%������w��{.n�������^n�r������&��z�ޗ�zf���h���~����������_��+v���)ߣ�