On Mon, 2023-11-06 at 09:33 +0800, Ed Tsai wrote: > On Sat, 2023-11-04 at 11:43 +0800, Ming Lei wrote: > > On Sat, Nov 04, 2023 at 01:11:02AM +0000, Ed Tsai (蔡宗軒) wrote: > > > On Sat, 2023-11-04 at 00:20 +0800, Ming Lei wrote: > > > > On Wed, Nov 01, 2023 at 02:23:26AM +0000, Ed Tsai (蔡宗軒) wrote: > > > > > On Wed, 2023-10-25 at 17:22 +0800, ed.tsai@xxxxxxxxxxxx > > > > > wrote: > > > > > > From: Ed Tsai <ed.tsai@xxxxxxxxxxxx> > > > > > > > > > > > > Referring to commit 07173c3ec276 ("block: enable multipage > > > > > > > > bvecs"), > > > > > > each bio_vec now holds more than one page, potentially > > > > exceeding > > > > > > 1MB in size and causing alignment issues with the queue > > > > limit. > > > > > > > > > > > > In a sequential read/write scenario, the file system > > > > maximizes > > > > the > > > > > > bio's capacity before submitting. However, misalignment > > > > > > with > > > > the > > > > > > queue limit can result in the bio being split into smaller > > > > I/O > > > > > > operations. > > > > > > > > > > > > For instance, assuming the maximum I/O size is set to 512KB > > > > and > > > > the > > > > > > memory is highly fragmented, resulting in each bio > > > > > > containing > > > > > > > > only > > > > > > one 2-pages bio_vec (i.e., bi_size = 1028KB). This would > > > > cause > > > > the > > > > > > bio to be split into two 512KB portions and one 4KB > > > > > > portion. > > > > As a > > > > > > result, the originally expected continuous large I/O > > > > operations > > > > are > > > > > > interspersed with many small I/O operations. > > > > > > > > > > > > To address this issue, this patch adds a check for the > > > > > > > > max_sectors > > > > > > before submitting the bio. This allows the upper layers to > > > > > > proactively > > > > > > detect and handle alignment issues. > > > > > > > > > > > > I performed the Antutu V10 Storage Test on a UFS 4.0 > > > > > > device, > > > > > > > > which > > > > > > resulted in a significant improvement in the Sequential > > > > > > test: > > > > > > > > > > > > Sequential Read (average of 5 rounds): > > > > > > Original: 3033.7 MB/sec > > > > > > Patched: 3520.9 MB/sec > > > > > > > > > > > > Sequential Write (average of 5 rounds): > > > > > > Original: 2225.4 MB/sec > > > > > > Patched: 2800.3 MB/sec > > > > > > > > > > > > Signed-off-by: Ed Tsai <ed.tsai@xxxxxxxxxxxx> > > > > > > --- > > > > > > block/bio.c | 6 ++++++ > > > > > > 1 file changed, 6 insertions(+) > > > > > > > > > > > > diff --git a/block/bio.c b/block/bio.c > > > > > > index 816d412c06e9..a4a1f775b9ea 100644 > > > > > > --- a/block/bio.c > > > > > > +++ b/block/bio.c > > > > > > @@ -1227,6 +1227,7 @@ static int > > > > __bio_iov_iter_get_pages(struct > > > > bio > > > > > > *bio, struct iov_iter *iter) > > > > > > iov_iter_extraction_t extraction_flags = 0; > > > > > > unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt; > > > > > > unsigned short entries_left = bio->bi_max_vecs - bio- > > > > > > bi_vcnt; > > > > > > +struct queue_limits *lim = &bdev_get_queue(bio->bi_bdev)- > > > > > > > limits; > > > > > > > > > > > > struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt; > > > > > > struct page **pages = (struct page **)bv; > > > > > > ssize_t size, left; > > > > > > @@ -1275,6 +1276,11 @@ static int > > > > __bio_iov_iter_get_pages(struct > > > > bio > > > > > > *bio, struct iov_iter *iter) > > > > > > struct page *page = pages[i]; > > > > > > > > > > > > len = min_t(size_t, PAGE_SIZE - offset, left); > > > > > > +if (bio->bi_iter.bi_size + len > > > > > > > + lim->max_sectors << SECTOR_SHIFT) { > > > > > > +ret = left; > > > > > > +break; > > > > > > +} > > > > > > if (bio_op(bio) == REQ_OP_ZONE_APPEND) { > > > > > > ret = bio_iov_add_zone_append_page(bio, page, > > > > > > len, > > > > > > offset); > > > > > > -- > > > > > > 2.18.0 > > > > > > > > > > > > > > > > > > > > > Hi Jens, > > > > > > > > > > Just to clarify any potential confusion, I would like to > > > > provide > > > > > further details based on the assumed scenario mentioned > > > > > above. > > > > > > > > > > When the upper layer continuously sends 1028KB full-sized > > > > > bios > > > > for > > > > > sequential reads, the Block Layer sees the following > > > > > sequence: > > > > > submit bio: size = 1028KB, start LBA = n > > > > > submit bio: size = 1028KB, start LBA = n + 1028KB > > > > > submit bio: size = 1028KB, start LBA = n + 2056KB > > > > > ... > > > > > > > > > > However, due to the queue limit restricting the I/O size to a > > > > > > > > maximum > > > > > of 512KB, the Block Layer splits into the following sequence: > > > > > submit bio: size = 512KB, start LBA = n > > > > > submit bio: size = 512KB, start LBA = n + 512KB > > > > > submit bio: size = 4KB, start LBA = n + 1024KB > > > > > submit bio: size = 512KB, start LBA = n + 1028KB > > > > > submit bio: size = 512KB, start LBA = n + 1540KB > > > > > submit bio: size = 4KB, start LBA = n + 2052KB > > > > > submit bio: size = 512KB, start LBA = n + 2056KB > > > > > submit bio: size = 512KB, start LBA = n + 2568KB > > > > > submit bio: size = 4KB, start LBA = n + 3080KB > > > > > ... > > > > > > > > > > The original expectation was for the storage to receive > > > > > large, > > > > > contiguous requests. However, due to non-alignment, many > > > > > small > > > > I/O > > > > > requests are generated. This problem is easily visible > > > > > because > > > > the > > > > > user pages passed in are often allocated by the buddy system > > > > > as > > > > > > > > order 0 > > > > > pages during page faults, resulting in highly non-contiguous > > > > > > > > memory. > > > > > > > > If order 0 page is added to bio, the multipage bvec becomes nop > > > > basically(256bvec holds 256 pages), then how can it make a > > > > difference > > > > for you? > > > > > > > > > > > > > > > > > > > > > > > As observed in the Antutu Sequential Read test below, it is > > > > similar > > > > to > > > > > the description above where the splitting caused by the queue > > > > limit > > > > > leaves small requests sandwiched in between: > > > > > > > > > > block_bio_queue: 8,32 R 86925864 + 2144 [Thread-51] > > > > > block_split: 8,32 R 86925864 / 86926888 [Thread-51] > > > > > block_split: 8,32 R 86926888 / 86927912 [Thread-51] > > > > > block_rq_issue: 8,32 R 524288 () 86925864 + 1024 [Thread-51] > > > > > block_rq_issue: 8,32 R 524288 () 86926888 + 1024 [Thread-51] > > > > > block_bio_queue: 8,32 R 86928008 + 2144 [Thread-51] > > > > > block_split: 8,32 R 86928008 / 86929032 [Thread-51] > > > > > block_split: 8,32 R 86929032 / 86930056 [Thread-51] > > > > > block_rq_issue: 8,32 R 524288 () 86928008 + 1024 [Thread-51] > > > > > block_rq_issue: 8,32 R 49152 () 86927912 + 96 [Thread-51] > > > > > block_rq_issue: 8,32 R 524288 () 86929032 + 1024 [Thread-51] > > > > > block_bio_queue: 8,32 R 86930152 + 2112 [Thread-51] > > > > > block_split: 8,32 R 86930152 / 86931176 [Thread-51] > > > > > block_split: 8,32 R 86931176 / 86932200 [Thread-51] > > > > > block_rq_issue: 8,32 R 524288 () 86930152 + 1024 [Thread-51] > > > > > block_rq_issue: 8,32 R 49152 () 86930056 + 96 [Thread-51] > > > > > block_rq_issue: 8,32 R 524288 () 86931176 + 1024 [Thread-51] > > > > > block_bio_queue: 8,32 R 86932264 + 2096 [Thread-51] > > > > > block_split: 8,32 R 86932264 / 86933288 [Thread-51] > > > > > block_split: 8,32 R 86933288 / 86934312 [Thread-51] > > > > > block_rq_issue: 8,32 R 524288 () 86932264 + 1024 [Thread-51] > > > > > block_rq_issue: 8,32 R 32768 () 86932200 + 64 [Thread-51] > > > > > block_rq_issue: 8,32 R 524288 () 86933288 + 1024 [Thread-51] > > > > > > > > > > I simply prevents non-aligned situations in > > > > bio_iov_iter_get_pages. > > > > > > > > But there is still 4KB IO left if you limit max bio size is > > > > 512KB, > > > > then how does this 4KB IO finally go in case of 1028KB IO? > > > > > > > > > Besides making the upper layer application aware of the queue > > > > > > > > limit, I > > > > > would appreciate any other directions or suggestions you may > > > > have. > > > > > > > > The problem is related with IO size from application. > > > > > > > > If you send unaligned IO, you can't avoid the last IO with > > > > small > > > > size, no > > > > matter if block layer bio split is involved or not. Your patch > > > > just > > > > lets > > > > __bio_iov_iter_get_pages split the bio, and you still have 4KB > > > > left > > > > finally when application submits 1028KB, right? > > > > > > > > Then I don't understand why your patch improves sequential IO > > > > performance. > > > > > > > > Thanks, > > > > Ming > > > > > > > > > > The application performs I/O with a sufficitenly large I/O size, > > > causing it to constantly fill up and submit full bios. However, > > > in > > > > the > > > iomap direct I/O scenario, pages are added to the bio one by one > > > > from > > > the user buffer. This typically triggers page faults, resulting > > > in > > > > the > > > allocation of order 0 pages from the buddy system. > > > > > > The remaining amount of each order in the buddy system varies > > > over > > > time. If there are not enough pages available in a particular > > > > order, > > > pages are split from higher orders. When pages are obtained from > > > > the > > > higher order, the user buffer may contain some small consecutive > > > patterns. > > > > > > In summary, the physical layout of the user buffer is > > > > unpredictable, > > > and when it contains some small consecutive patterns, the size of > > > > the > > > bio becomes randomly unaligned during filling. > > > > Yes, multipage bvec depends on physical layout of user buffer, but > > it > > doesn't affect bio size, which is decided by userspace, and the > > bvec > > page layout doesn't affect IO pattern. > > > > This will result in variable sizes of full bios when filling them, as > the size of the bio_vec depends on the physical layout of the user > buffer. > > > If userspace submits 1028K IO, the IO is split into 512K, 512K and > > 4K, > > no matter if each bvec includes how many pages. > > > > Sorry for the confusion caused by my description of 1028KB. It was > meant to illustrate the scenario of submitting an unaligned bio. In > my > opinion, the least ideal size for a full bio due to physical layout > would be 1028KB, as it would result in an I/O with only 4KB. > > > > If userspace submits very large IO, such as 512M, which will be > > split > > into 1K bios with 512K size. > > > > No. The block layer cannot receive a 512MB bio to split into 512K > size. > It will encounter full bios of various size, because the size of each > bio_vec is based on the physical layout. > > > > You patch doesn't makes any difference actually from block layer > > viewpoint, such as: > > > > 1) dd if=/dev/ublkb0 of=/dev/null bs=1028k count=1 iflag=direct > > > > 2) observe IO pattern by the following bpftrace: > > > > kprobe:blk_mq_start_request > > { > > $rq = (struct request *)arg0; > > > > printf("%lu %16s %d %d: %s %s bio %lx-%lu\n", > > nsecs / 1000, comm, pid, cpu, ksym(reg("ip")), > > $rq->q->disk->disk_name, $rq->__sector, $rq->__data_len); > > } > > > > 3) no matter if your patch is applied or not, the following pattern > > is always observed: > > > > 117828811 dd 1475 0: blk_mq_start_request ublkb0 bio > > 0- > > 524288 > > 117828823 dd 1475 0: blk_mq_start_request ublkb0 bio > > 400-524288 > > 117828825 dd 1475 0: blk_mq_start_request ublkb0 bio > > 800-4096 > > > > Similar pattern can be observed when running dd inside FS(xfs). > > > > > > > > This patch limits the bio to be filled up to the max_sectors. The > > > submission is an async operation, so once the bio is queued, it > > > > will > > > immediately return and continue filled and submit the next bio. > > > > bio split doesn't change this behavior too, the difference is just > > how > > many bios the upper layer(iomap) observed. > > > > Your patch just moves the split into upper layer, and iomap can see > > 3 bios with your patch when `dd bs=1028K`, and in vanilla tree, > > iomap > > just submits single bio with 1028K, block layer splits it into > > 512k, 512k, and 4k. So finally UFS driver observes same IO pattern. > > > > In short, please root cause why your patch improves performance, or > > please share your workloads, so we can observe the IO pattern and > > related mm/fs behavior and try to root cause it. > > > > > > Thanks, > > Ming > > > > Antutu Sequential Read performs 72 reads of 64MB using aio. I > simulated > a 64MB read using dd and observed that the actual bio queue sizes > were > slightly larger than 1MB. Not a single bio with 64MB, but rather > multiple various size of bios. > > dd-5970[001] ..... 758.485446: block_bio_queue: 254,52 R 2933336 + > 2136 > dd-5970[001] ..... 758.487145: block_bio_queue: 254,52 R 2935472 + > 2152 > dd-5970[001] ..... 758.488822: block_bio_queue: 254,52 R 2937624 + > 2128 > dd-5970[001] ..... 758.490478: block_bio_queue: 254,52 R 2939752 + > 2160 > dd-5970[001] ..... 758.492154: block_bio_queue: 254,52 R 2941912 + > 2096 > dd-5970[001] ..... 758.493874: block_bio_queue: 254,52 R 2944008 + > 2160 > Sorry for missing out on my dd command. Here it is: dd if=/data/test_file of=/dev/null bs=64m count=1 iflag=direct Best, Ed