On Mon, Feb 01, 2021 at 11:52:48AM +0900, Changheun Lee wrote: > > On Fri, Jan 29, 2021 at 12:49:08PM +0900, Changheun Lee wrote: > > > bio size can grow up to 4GB when muli-page bvec is enabled. > > > but sometimes it would lead to inefficient behaviors. > > > in case of large chunk direct I/O, - 32MB chunk read in user space - > > > all pages for 32MB would be merged to a bio structure if the pages > > > physical addresses are contiguous. it makes some delay to submit > > > until merge complete. bio max size should be limited to a proper size. > > > > > > When 32MB chunk read with direct I/O option is coming from userspace, > > > kernel behavior is below now. it's timeline. > > > > > > | bio merge for 32MB. total 8,192 pages are merged. > > > | total elapsed time is over 2ms. > > > |------------------ ... ----------------------->| > > > | 8,192 pages merged a bio. > > > | at this time, first bio submit is done. > > > | 1 bio is split to 32 read request and issue. > > > |---------------> > > > |---------------> > > > |---------------> > > > ...... > > > |---------------> > > > |--------------->| > > > total 19ms elapsed to complete 32MB read done from device. | > > > > > > If bio max size is limited with 1MB, behavior is changed below. > > > bio_iov_iter_get_pages> > > | bio merge for 1MB. 256 pages are merged for each bio. > > > | total 32 bio will be made. > > > | total elapsed time is over 2ms. it's same. > > > | but, first bio submit timing is fast. about 100us. > > > |--->|--->|--->|---> ... -->|--->|--->|--->|--->| > > > | 256 pages merged a bio. > > > | at this time, first bio submit is done. > > > | and 1 read request is issued for 1 bio. > > > |---------------> > > > |---------------> > > > |---------------> > > > ...... > > > |---------------> > > > |--------------->| > > > total 17ms elapsed to complete 32MB read done from device. | > > > > Can you share us if enabling THP in your application can avoid this issue? BTW, you > > need to make the 32MB buffer aligned with huge page size. IMO, THP perfectly fits > > your case. > > > > THP is enabled already like as below in my environment. It has no effect. > > cat /sys/kernel/mm/transparent_hugepage/enabled > [always] madvise never The 32MB user buffer needs to be huge page size aligned. If your system supports bcc/bpftrace, it is quite easy to check if the buffer is aligned. > > This issue was reported from performance benchmark application in open market. > I can't control application's working in open market. > It's not only my own case. This issue might be occured in many mobile environment. > At least, I checked this problem in exynos, and qualcomm chipset. You just said it takes 2ms for building 32MB bio, but you never investigate the reason. I guess it is from get_user_pages_fast(), but maybe others. Can you dig further for the reason? Maybe it is one arm64 specific issue. BTW, bio_iov_iter_get_pages() just takes ~200us on one x86_64 VM with THP, which is observed via bcc/funclatency when running the following workload: [root@ktest-01 test]# cat fio.job [global] bs=32768k rw=randread iodepth=1 ioengine=psync direct=1 runtime=20 time_based group_reporting=0 ramp_time=5 [diotest] filename=/dev/sde [root@ktest-01 func]# /usr/share/bcc/tools/funclatency bio_iov_iter_get_pages Tracing 1 functions for "bio_iov_iter_get_pages"... Hit Ctrl-C to end. ^C nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 0 | | 2048 -> 4095 : 0 | | 4096 -> 8191 : 0 | | 8192 -> 16383 : 0 | | 16384 -> 32767 : 0 | | 32768 -> 65535 : 0 | | 65536 -> 131071 : 0 | | 131072 -> 262143 : 1842 |****************************************| 262144 -> 524287 : 125 |** | 524288 -> 1048575 : 6 | | 1048576 -> 2097151 : 0 | | 2097152 -> 4194303 : 1 | | 4194304 -> 8388607 : 0 | | 8388608 -> 16777215 : 1 | | Detaching... -- Ming