Elliott, Robert (Persistent Memory) wrote on 03/03/2016 01:03 PM: >> -----Original Message----- >> From: Vladislav Bolkhovitin [mailto:vst@xxxxxxxx] >> Sent: Wednesday, March 2, 2016 9:03 PM >> To: Elliott, Robert (Persistent Memory) <elliott@xxxxxxx>; Sitsofe Wheeler >> <sitsofe@xxxxxxxxx>; fio@xxxxxxxxxxxxxxx >> Subject: Re: Fio high IOPS measurement mistake >> > ... >> >> Overall, I appreciate your help, but again, question is not how to improve >> my results. >> The question is how to _decrease fio overhead_ with libaio, see subject of >> this e-mail. >> It's very different question. >> >> Thanks, >> Vlad > > Here are some example results on one of my test systems with 4.4rc2, > showing %usr around 19%. > > This job file: > [global] > direct=1 > ioengine=libaio > norandommap > randrepeat=0 > bs=4k > iodepth=1 # irrelevant for pmem > runtime=600 > time_based=1 > group_reporting > thread > gtod_reduce=1 # reduce=1 except for latency test > zero_buffers > cpus_allowed_policy=split > numjobs=16 > > [drive_0] > filename=/dev/pmem0 > cpus_allowed=0-63 > rw=randread > > [drive_1] > filename=/dev/pmem1 > cpus_allowed=0-63 > rw=randread > > [drive_2] > filename=/dev/pmem2 > cpus_allowed=0-63 > rw=randread > > [drive_3] > filename=/dev/pmem3 > cpus_allowed=0-63 > rw=randread > > yields about 16M IOPS: > read : io=9013.8GB, bw=63505MB/s, iops=16257K, runt=145344msec > cpu : usr=19.04%, sys=80.86%, ctx=79415, majf=0, minf=4521 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued : total=r=2362899826/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 > latency : target=0, window=0, percentile=100.00%, depth=1 > > with mpstat 1 reporting about 19% usr, 91% sys: > 02:17:13 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle > 02:17:14 PM all 19.11 0.00 80.89 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 02:17:15 PM all 19.19 0.00 80.81 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 02:17:16 PM all 19.27 0.00 80.73 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 02:17:17 PM all 19.26 0.00 80.74 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > > With this test, the thread and zero_buffers options don't matter. > > The system has 4 NUMA nodes; restricting cpus_allowed to local CPUs > for each pmem device raises that to 20M IOPS. > read : io=7998.5GB, bw=78461MB/s, iops=20086K, runt=104388msec > cpu : usr=19.55%, sys=56.98%, ctx=43481, majf=0, minf=3956 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued : total=r=2096751180/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 > latency : target=0, window=0, percentile=100.00%, depth=1 > > perf top --dsos fio: > 3.00% [.] get_io_u > 2.22% [.] get_next_rand_offset > 2.15% [.] thread_main > 2.11% [.] io_u_queued_complete > 1.64% [.] td_io_queue > 1.44% [.] __get_io_u > 1.40% [.] io_completed > 1.17% [.] fio_libaio_commit > 0.93% [.] fio_libaio_prep > 0.84% [.] utime_since_now > 0.74% [.] wait_for_completions > 0.67% [.] fio_libaio_queue > 0.60% [.] fio_libaio_getevents > 0.54% [.] td_io_getevents > > perf top -g: > + 67.45% 0.45% [kernel] [k] entry_SYSCALL_64_fastpath > + 63.61% 0.68% libaio.so.1.0.1 [.] io_submit > + 61.08% 0.10% [kernel] [k] sys_io_submit > + 59.96% 1.55% [kernel] [k] do_io_submit > + 52.82% 0.68% [kernel] [k] aio_run_iocb > + 42.85% 0.36% [kernel] [k] blkdev_read_iter > + 42.20% 0.88% [kernel] [k] generic_file_read_iter > + 40.96% 0.49% [kernel] [k] blkdev_direct_IO > + 40.20% 2.70% [kernel] [k] dax_do_io > + 35.93% 35.93% [kernel] [k] copy_user_enhanced_fast_string > + 6.09% 2.79% [kernel] [k] aio_complete > + 5.55% 0.43% [kernel] [k] sys_io_getevents > + 5.38% 0.00% [unknown] [.] 0x0684000241000684 > + 4.09% 0.35% [kernel] [k] read_events > + 3.01% 0.00% [unknown] [.] 0000000000000000 > + 2.98% 0.62% [kernel] [k] rw_verify_area > + 2.95% 2.93% fio [.] get_io_u > + 2.67% 0.01% perf [.] hist_entry_iter__add > + 2.42% 1.88% [kernel] [k] aio_read_events > + 2.20% 0.36% [kernel] [k] security_file_permission > + 2.13% 2.11% fio [.] thread_main > + 2.09% 2.08% fio [.] get_next_rand_offset > + 2.01% 1.99% fio [.] io_u_queued_complete > + 1.96% 0.00% libaio.so.1.0.1 [.] 0xffff80df612af644 > + 1.66% 1.66% [kernel] [k] lookup_ioctx > + 1.51% 0.23% [kernel] [k] dax_map_atomic > + 1.49% 1.49% [kernel] [k] entry_SYSCALL_64_after_swapgs > + 1.49% 1.48% fio [.] td_io_queue > + 1.46% 1.46% [kernel] [k] __fget > + 1.39% 1.38% fio [.] io_completed > + 1.36% 1.35% fio [.] __get_io_u > + 1.34% 1.34% [kernel] [k] entry_SYSCALL_64 > + 1.33% 0.08% [kernel] [k] fget > + 1.14% 1.13% fio [.] fio_libaio_commit > + 1.12% 0.99% [kernel] [k] selinux_file_permission > + 1.03% 1.03% [kernel] [k] kmem_cache_alloc > + 0.94% 0.54% [kernel] [k] bdev_direct_access > + 0.91% 0.14% [kernel] [k] kiocb_free > + 0.90% 0.89% fio [.] fio_libaio_prep > + 0.88% 0.28% [kernel] [k] refill_reqs_available > + 0.86% 0.85% fio [.] utime_since_now > + 0.79% 0.79% [kernel] [k] get_reqs_available > + 0.79% 0.79% [kernel] [k] kmem_cache_free Thank you, you are proving my point and my concerns. Your per job IOPS (~1M) and user space consumption (20%) are similar to mine (640K and 25% correspondingly) and far from max IOPS possible (16M), so fio (or libaio?) overhead is seen in full in your test. Difference between your and my results might be explained that you are using the latest development, while I'm using SLES 12SP1, which is, as you can imagine, far behind of the latest development. Moreover, what is your PMEM? If it is a regular DDR4, depending from how many DIMMs you have (I guess, 4+ per NUMA node to populate all memory channels?) it should be capable of more or much more, than 16M IOPS overall and 1M per-thread (much more for sure), so it smells to me that the fio instrumental mistake plays significant role in your measurements, making your results significantly lower, than real HW and IO stack are really capable, hence you are pushing your fio tool in the range, where its accuracy drops significantly. I'd bet, if you take an ideal benchmarking tool with zero overhead, your results would be significantly higher. Actually, we are seeing this with our SCST (Linux SCSI target) tests, when with multiple initiators we sometimes have better performance, than with FIO locally. Until now I have never had time to look at it more closely. Looks like I have explanation now. Thanks, Vlad -- To unsubscribe from this list: send the line "unsubscribe fio" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html