Anton, Did you see Elliot's suggestions (i.e. point 4) in his previous email (http://www.spinics.net/lists/fio/msg06561.html ) for speeding things up and how you are at the mercy of glibc? On 28 November 2017 at 14:25, Gavriliuk, Anton (HPS Ukraine) <anton.gavriliuk@xxxxxxx> wrote: > Ok, now I fixed that. Seems to be sequence of the keys is important. I fixed that. Now I have next issue :-) > > I trying to speed up customer’s script. [snip] > lat (usec): min=1, max=77, avg=10.57, stdev= 6.01 > clat percentiles (usec): > | 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 0], > | 30.00th=[ 0], 40.00th=[ 0], 50.00th=[ 0], 60.00th=[ 0], > | 70.00th=[ 0], 80.00th=[ 1], 90.00th=[ 1], 95.00th=[ 1], > | 99.00th=[ 1], 99.50th=[ 1], 99.90th=[ 1], 99.95th=[ 1], > | 99.99th=[ 1] > bw (KB /s): min=38528, max=118396, per=6.36%, avg=103720.16, stdev=8779.81 > write: io=4559.9GB, bw=30234MB/s, iops=594518, runt=154436msec > slat (usec): min=1, max=787, avg=16.46, stdev=10.00 > clat (usec): min=0, max=680, avg= 0.20, stdev= 0.41 > lat (usec): min=1, max=789, avg=16.69, stdev=10.00 > clat percentiles (usec): > | 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 0], > | 30.00th=[ 0], 40.00th=[ 0], 50.00th=[ 0], 60.00th=[ 0], > | 70.00th=[ 0], 80.00th=[ 1], 90.00th=[ 1], 95.00th=[ 1], > | 99.00th=[ 1], 99.50th=[ 1], 99.90th=[ 1], 99.95th=[ 1], > | 99.99th=[ 1] > bw (MB /s): min= 731, max= 2015, per=6.36%, avg=1922.69, stdev=146.41 > lat (usec) : 2=100.00%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01% > lat (usec) : 250=0.01%, 750=0.01% > cpu : usr=35.61%, sys=64.37%, ctx=2509, majf=0, minf=930 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued : total=r=4836244/w=91815119/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 > latency : target=0, window=0, percentile=100.00%, depth=1 > > Run status group 0 (all jobs): > READ: io=245977MB, aggrb=1592.8MB/s, minb=1592.8MB/s, maxb=1592.8MB/s, mint=154436msec, maxt=154436msec > WRITE: io=4559.9GB, aggrb=30234MB/s, minb=30234MB/s, maxb=30234MB/s, mint=154436msec, maxt=154436msec > > Now I want saving customer’s I/O pattern, but switch to the DAX access (/dev/dax0.0 raw device) with mmap engine, trying to bypass OS block device & system call layers. So I modified script, > > dl560g10spmem01:~/FIO-PMEM_TestScripts/DAX # /usr/bin/fio --name=4-rand-rw-3xx --ioengine=mmap --iodepth=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --bssplit=4k/4:8k/7:16k/7:32k/15:64k/65:128k/1:256k/1 --rwmixread=5 --size=290g --numjobs=16 --group_reporting --runtime=120 --filename=/dev/dax0.0 > 4-rand-rw-3xx: (g=0): rw=randrw, bs=4K-256K/4K-256K/4K-256K, ioengine=mmap, iodepth=1 > ... > fio-2.12 > Starting 16 processes > Jobs: 12 (f=12): [m(3),_(2),m(1),_(1),m(6),_(1),m(2)] [78.6% done] [1304MB/24907MB/0KB /s] [25.7K/490K/0 iops] [eta 00m:33s] > 4-rand-rw-3xx: (groupid=0, jobs=16): err= 0: pid=65032: Tue Nov 28 06:58:09 2017 > read : io=202831MB, bw=1690.3MB/s, iops=33231, runt=120001msec > clat (usec): min=0, max=211, avg= 8.88, stdev= 5.23 > lat (usec): min=0, max=211, avg= 8.92, stdev= 5.23 > clat percentiles (usec): > | 1.00th=[ 1], 5.00th=[ 1], 10.00th=[ 2], 20.00th=[ 4], > | 30.00th=[ 6], 40.00th=[ 8], 50.00th=[ 10], 60.00th=[ 11], > | 70.00th=[ 11], 80.00th=[ 12], 90.00th=[ 13], 95.00th=[ 14], > | 99.00th=[ 29], 99.50th=[ 43], 99.90th=[ 48], 99.95th=[ 49], > | 99.99th=[ 53] > bw (KB /s): min=86408, max=159184, per=6.36%, avg=110086.50, stdev=19409.13 > write: io=3764.3GB, bw=32121MB/s, iops=631607, runt=120001msec > clat (usec): min=0, max=744, avg=15.16, stdev=10.09 > lat (usec): min=0, max=744, avg=15.21, stdev=10.09 > clat percentiles (usec): > | 1.00th=[ 1], 5.00th=[ 2], 10.00th=[ 3], 20.00th=[ 6], > | 30.00th=[ 11], 40.00th=[ 12], 50.00th=[ 12], 60.00th=[ 20], > | 70.00th=[ 21], 80.00th=[ 22], 90.00th=[ 23], 95.00th=[ 24], > | 99.00th=[ 46], 99.50th=[ 86], 99.90th=[ 91], 99.95th=[ 92], > | 99.99th=[ 96] > bw (MB /s): min= 1760, max= 2737, per=6.36%, avg=2043.01, stdev=353.05 > lat (usec) : 2=4.63%, 4=8.45%, 10=11.27%, 20=32.41%, 50=42.58% > lat (usec) : 100=0.66%, 250=0.01%, 500=0.01%, 750=0.01% > cpu : usr=99.74%, sys=0.24%, ctx=2791, majf=0, minf=2822710 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued : total=r=3987782/w=75793478/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 > latency : target=0, window=0, percentile=100.00%, depth=1 > > Run status group 0 (all jobs): > READ: io=202831MB, aggrb=1690.3MB/s, minb=1690.3MB/s, maxb=1690.3MB/s, mint=120001msec, maxt=120001msec > WRITE: io=3764.3GB, aggrb=32121MB/s, minb=32121MB/s, maxb=32121MB/s, mint=120001msec, maxt=120001msec > > No huge growth detected – lat 16.69 vs 15.21, write bw – 30234Mb/s vs 32121Mb/s. > > Looking to the performance traces, > > PID 65043 /usr/bin/fio > > HARDCLOCK entries > Count Pct State Function > 2210 66.29% USER __memcpy_avx_unaligned [/lib64/libc-2.22.so] > 1122 33.65% USER UNKNOWN > 2 0.06% SYS pagerange_is_ram_callback > > Count Pct HARDCLOCK Stack trace > ============================================================ > 2 0.06% pagerange_is_ram_callback walk_system_ram_range pat_pagerange_is_ram lookup_memtype track_pfn_insert vmf_insert_pfn_pmd dax_dev_pmd_fault handle_mm_fault __do_page_fault do_page_fault page_fault unknown | __memcpy_avx_unaligned > > 25.620925 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 > 25.644924 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 > 25.656923 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 > 25.680925 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 > 25.692924 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 > 25.704925 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 > 25.716925 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 > 25.728923 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 > 25.764924 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 > 25.776924 cpu=51 pid=65043 tgid=65043 hardclock state=USER [libc-2.22.so]:__memcpy_avx_unaligned+0x2b6 > > Seems to be there lot's unaligned memory access, any ideas ? > > You could also look at the source code for fio to see what it’s doing or put in some printf() statements so you can understand what the actual alignment is. > > Anton -- Sitsofe | http://sucs.org/~sits/ -- To unsubscribe from this list: send the line "unsubscribe fio" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html