Ok, now I fixed that. Seems to be sequence of the keys is important. I fixed that. Now I have next issue :-) I trying to speed up customer’s script. Initially it was with /dev/pmem & libaio engine, dl560g10spmem01:~/FIO-PMEM_TestScripts/DAX # /usr/bin/fio --filename=/dev/pmem1 --direct=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --iodepth=1 --numjobs=16 --runtime=1800 --group_reporting --name=4-rand-rw-3xx 4-rand-rw-3xx: (g=0): rw=randrw, bs=4K-256K/4K-256K/4K-256K, ioengine=libaio, iodepth=1 ... fio-2.12 Starting 16 processes Jobs: 2 (f=2): [_(5),m(1),_(7),m(1),_(2)] [100.0% done] [357.4MB/6990MB/0KB /s] [7066/137K/0 iops] [eta 00m:00s] 4-rand-rw-3xx: (groupid=0, jobs=16): err= 0: pid=23144: Mon Nov 27 19:56:04 2017 read : io=245977MB, bw=1592.8MB/s, iops=31315, runt=154436msec slat (usec): min=1, max=77, avg=10.34, stdev= 6.00 clat (usec): min=0, max=6, avg= 0.20, stdev= 0.40 lat (usec): min=1, max=77, avg=10.57, stdev= 6.01 clat percentiles (usec): | 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 0], | 30.00th=[ 0], 40.00th=[ 0], 50.00th=[ 0], 60.00th=[ 0], | 70.00th=[ 0], 80.00th=[ 1], 90.00th=[ 1], 95.00th=[ 1], | 99.00th=[ 1], 99.50th=[ 1], 99.90th=[ 1], 99.95th=[ 1], | 99.99th=[ 1] bw (KB /s): min=38528, max=118396, per=6.36%, avg=103720.16, stdev=8779.81 write: io=4559.9GB, bw=30234MB/s, iops=594518, runt=154436msec slat (usec): min=1, max=787, avg=16.46, stdev=10.00 clat (usec): min=0, max=680, avg= 0.20, stdev= 0.41 lat (usec): min=1, max=789, avg=16.69, stdev=10.00 clat percentiles (usec): | 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 0], | 30.00th=[ 0], 40.00th=[ 0], 50.00th=[ 0], 60.00th=[ 0], | 70.00th=[ 0], 80.00th=[ 1], 90.00th=[ 1], 95.00th=[ 1], | 99.00th=[ 1], 99.50th=[ 1], 99.90th=[ 1], 99.95th=[ 1], | 99.99th=[ 1] bw (MB /s): min= 731, max= 2015, per=6.36%, avg=1922.69, stdev=146.41 lat (usec) : 2=100.00%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01% lat (usec) : 250=0.01%, 750=0.01% cpu : usr=35.61%, sys=64.37%, ctx=2509, majf=0, minf=930 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=4836244/w=91815119/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: io=245977MB, aggrb=1592.8MB/s, minb=1592.8MB/s, maxb=1592.8MB/s, mint=154436msec, maxt=154436msec WRITE: io=4559.9GB, aggrb=30234MB/s, minb=30234MB/s, maxb=30234MB/s, mint=154436msec, maxt=154436msec Now I want saving customer’s I/O pattern, but switch to the DAX access (/dev/dax0.0 raw device) with mmap engine, trying to bypass OS block device & system call layers. So I modified script, dl560g10spmem01:~/FIO-PMEM_TestScripts/DAX # /usr/bin/fio --name=4-rand-rw-3xx --ioengine=mmap --iodepth=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --size=290g --numjobs=16 --group_reporting --runtime=120 --filename=/dev/dax0.0 4-rand-rw-3xx: (g=0): rw=randrw, bs=4K-256K/4K-256K/4K-256K, ioengine=mmap, iodepth=1 ... fio-2.12 Starting 16 processes Jobs: 12 (f=12): [m(3),_(2),m(1),_(1),m(6),_(1),m(2)] [78.6% done] [1304MB/24907MB/0KB /s] [25.7K/490K/0 iops] [eta 00m:33s] 4-rand-rw-3xx: (groupid=0, jobs=16): err= 0: pid=65032: Tue Nov 28 06:58:09 2017 read : io=202831MB, bw=1690.3MB/s, iops=33231, runt=120001msec clat (usec): min=0, max=211, avg= 8.88, stdev= 5.23 lat (usec): min=0, max=211, avg= 8.92, stdev= 5.23 clat percentiles (usec): | 1.00th=[ 1], 5.00th=[ 1], 10.00th=[ 2], 20.00th=[ 4], | 30.00th=[ 6], 40.00th=[ 8], 50.00th=[ 10], 60.00th=[ 11], | 70.00th=[ 11], 80.00th=[ 12], 90.00th=[ 13], 95.00th=[ 14], | 99.00th=[ 29], 99.50th=[ 43], 99.90th=[ 48], 99.95th=[ 49], | 99.99th=[ 53] bw (KB /s): min=86408, max=159184, per=6.36%, avg=110086.50, stdev=19409.13 write: io=3764.3GB, bw=32121MB/s, iops=631607, runt=120001msec clat (usec): min=0, max=744, avg=15.16, stdev=10.09 lat (usec): min=0, max=744, avg=15.21, stdev=10.09 clat percentiles (usec): | 1.00th=[ 1], 5.00th=[ 2], 10.00th=[ 3], 20.00th=[ 6], | 30.00th=[ 11], 40.00th=[ 12], 50.00th=[ 12], 60.00th=[ 20], | 70.00th=[ 21], 80.00th=[ 22], 90.00th=[ 23], 95.00th=[ 24], | 99.00th=[ 46], 99.50th=[ 86], 99.90th=[ 91], 99.95th=[ 92], | 99.99th=[ 96] bw (MB /s): min= 1760, max= 2737, per=6.36%, avg=2043.01, stdev=353.05 lat (usec) : 2=4.63%, 4=8.45%, 10=11.27%, 20=32.41%, 50=42.58% lat (usec) : 100=0.66%, 250=0.01%, 500=0.01%, 750=0.01% cpu : usr=99.74%, sys=0.24%, ctx=2791, majf=0, minf=2822710 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=3987782/w=75793478/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: io=202831MB, aggrb=1690.3MB/s, minb=1690.3MB/s, maxb=1690.3MB/s, mint=120001msec, maxt=120001msec WRITE: io=3764.3GB, aggrb=32121MB/s, minb=32121MB/s, maxb=32121MB/s, mint=120001msec, maxt=120001msec No huge growth detected – lat 16.69 vs 15.21, write bw – 30234Mb/s vs 32121Mb/s. Looking to the performance traces, PID 65043 /usr/bin/fio HARDCLOCK entries Count Pct State Function 2210 66.29% USER __memcpy_avx_unaligned [/lib64/libc-2.22.so] 1122 33.65% USER UNKNOWN 2 0.06% SYS pagerange_is_ram_callback Count Pct HARDCLOCK Stack trace ============================================================ 2 0.06% pagerange_is_ram_callback walk_system_ram_range pat_pagerange_is_ram lookup_memtype track_pfn_insert vmf_insert_pfn_pmd dax_dev_pmd_fault handle_mm_fault __do_page_fault do_page_fault page_fault unknown | __memcpy_avx_unaligned 25.620925 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 25.644924 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 25.656923 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 25.680925 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 25.692924 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 25.704925 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 25.716925 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 25.728923 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 25.764924 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 25.776924 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 Seems to be there lot's unaligned memory access, any ideas ? You could also look at the source code for fio to see what it’s doing or put in some printf() statements so you can understand what the actual alignment is. Anton -----Original Message----- From: Sitsofe Wheeler [mailto:sitsofe@xxxxxxxxx] Sent: Monday, November 27, 2017 9:38 PM To: Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@xxxxxxx> Cc: fio@xxxxxxxxxxxxxxx; Brian Boylston <brian.boylston@xxxxxxx>; Elliott, Robert (Persistent Memory) <elliott@xxxxxxx> Subject: Re: fio 3.2 Unfortunately I don't have access to a pmem device but let's see how far we get: On 27 November 2017 at 12:39, Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@xxxxxxx> wrote: > > result=$(fio --name=random-writers --ioengine=mmap --iodepth=32 > --rw=randwrite --bs=64k --size=1024m --numjobs=8 --group_reporting=1 > --eta=never --time_based --runtime=60 --filename=/dev/dax0.0 | grep > WRITE) Please make your problem scenarios as simple as possible: 1. Just run fio normally so we can see the output it produces on both stdout and stderr normally. 2. Reduce the job back so it there are only the bare minimum options that reproduce the problem. 3. Try to avoid changing lots of things. Here you've switched ioengine introducing another place to look. Instead how about this: Was your /dev/dax0.0 device made using ndctl? Assuming yes: fio --name=dax-mmap --ioengine=mmap --rw=write --bs=64k --eta=never --time_based --runtime=60 --filename=/dev/dax0.0 --size=2g (apparently a size has to be specified when you try to use a character device - see https://nvdimm.wiki.kernel.org/ ) If you run just that by itself what do you see? Next up: fio --name=dax-dax --ioengine=dev-dax --rw=write --bs=64k --eta=never --time_based --runtime=60 --filename=/dev/dax0.0 --size=2g If you run just that by itself what do you see? Finally: Assuming a -o dax mounted filesystem on /pmem0/ : fio --name=libpmemblk --ioengine=pmemblk --rw=write --bs=64k --eta=never --time_based --runtime=60 --filename=/pmem0/fio-test,4096,1024 --thread=1 If you run just that by itself what do you see? (Perhaps the documentation for these ioengines and pmem devices needs to be improved?) -- Sitsofe | http://sucs.org/~sits/ ��.n��������+%������w��{.n�������^n�r������&��z�ޗ�zf���h���~����������_��+v���)ߣ�