On Sat, 2023-11-04 at 11:43 +0800, Ming Lei wrote: > On Sat, Nov 04, 2023 at 01:11:02AM +0000, Ed Tsai (蔡宗軒) wrote: > > On Sat, 2023-11-04 at 00:20 +0800, Ming Lei wrote: > > > On Wed, Nov 01, 2023 at 02:23:26AM +0000, Ed Tsai (蔡宗軒) wrote: > > > > On Wed, 2023-10-25 at 17:22 +0800, ed.tsai@xxxxxxxxxxxx wrote: > > > > > From: Ed Tsai <ed.tsai@xxxxxxxxxxxx> > > > > > > > > > > Referring to commit 07173c3ec276 ("block: enable multipage > > > bvecs"), > > > > > each bio_vec now holds more than one page, potentially > exceeding > > > > > 1MB in size and causing alignment issues with the queue > limit. > > > > > > > > > > In a sequential read/write scenario, the file system > maximizes > > > the > > > > > bio's capacity before submitting. However, misalignment with > the > > > > > queue limit can result in the bio being split into smaller > I/O > > > > > operations. > > > > > > > > > > For instance, assuming the maximum I/O size is set to 512KB > and > > > the > > > > > memory is highly fragmented, resulting in each bio containing > > > only > > > > > one 2-pages bio_vec (i.e., bi_size = 1028KB). This would > cause > > > the > > > > > bio to be split into two 512KB portions and one 4KB portion. > As a > > > > > result, the originally expected continuous large I/O > operations > > > are > > > > > interspersed with many small I/O operations. > > > > > > > > > > To address this issue, this patch adds a check for the > > > max_sectors > > > > > before submitting the bio. This allows the upper layers to > > > > > proactively > > > > > detect and handle alignment issues. > > > > > > > > > > I performed the Antutu V10 Storage Test on a UFS 4.0 device, > > > which > > > > > resulted in a significant improvement in the Sequential test: > > > > > > > > > > Sequential Read (average of 5 rounds): > > > > > Original: 3033.7 MB/sec > > > > > Patched: 3520.9 MB/sec > > > > > > > > > > Sequential Write (average of 5 rounds): > > > > > Original: 2225.4 MB/sec > > > > > Patched: 2800.3 MB/sec > > > > > > > > > > Signed-off-by: Ed Tsai <ed.tsai@xxxxxxxxxxxx> > > > > > --- > > > > > block/bio.c | 6 ++++++ > > > > > 1 file changed, 6 insertions(+) > > > > > > > > > > diff --git a/block/bio.c b/block/bio.c > > > > > index 816d412c06e9..a4a1f775b9ea 100644 > > > > > --- a/block/bio.c > > > > > +++ b/block/bio.c > > > > > @@ -1227,6 +1227,7 @@ static int > __bio_iov_iter_get_pages(struct > > > bio > > > > > *bio, struct iov_iter *iter) > > > > > iov_iter_extraction_t extraction_flags = 0; > > > > > unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt; > > > > > unsigned short entries_left = bio->bi_max_vecs - bio- > >bi_vcnt; > > > > > +struct queue_limits *lim = &bdev_get_queue(bio->bi_bdev)- > > > > > >limits; > > > > > struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt; > > > > > struct page **pages = (struct page **)bv; > > > > > ssize_t size, left; > > > > > @@ -1275,6 +1276,11 @@ static int > __bio_iov_iter_get_pages(struct > > > bio > > > > > *bio, struct iov_iter *iter) > > > > > struct page *page = pages[i]; > > > > > > > > > > len = min_t(size_t, PAGE_SIZE - offset, left); > > > > > +if (bio->bi_iter.bi_size + len > > > > > > + lim->max_sectors << SECTOR_SHIFT) { > > > > > +ret = left; > > > > > +break; > > > > > +} > > > > > if (bio_op(bio) == REQ_OP_ZONE_APPEND) { > > > > > ret = bio_iov_add_zone_append_page(bio, page, > > > > > len, > > > > > offset); > > > > > -- > > > > > 2.18.0 > > > > > > > > > > > > > > > > > Hi Jens, > > > > > > > > Just to clarify any potential confusion, I would like to > provide > > > > further details based on the assumed scenario mentioned above. > > > > > > > > When the upper layer continuously sends 1028KB full-sized bios > for > > > > sequential reads, the Block Layer sees the following sequence: > > > > submit bio: size = 1028KB, start LBA = n > > > > submit bio: size = 1028KB, start LBA = n + 1028KB > > > > submit bio: size = 1028KB, start LBA = n + 2056KB > > > > ... > > > > > > > > However, due to the queue limit restricting the I/O size to a > > > maximum > > > > of 512KB, the Block Layer splits into the following sequence: > > > > submit bio: size = 512KB, start LBA = n > > > > submit bio: size = 512KB, start LBA = n + 512KB > > > > submit bio: size = 4KB, start LBA = n + 1024KB > > > > submit bio: size = 512KB, start LBA = n + 1028KB > > > > submit bio: size = 512KB, start LBA = n + 1540KB > > > > submit bio: size = 4KB, start LBA = n + 2052KB > > > > submit bio: size = 512KB, start LBA = n + 2056KB > > > > submit bio: size = 512KB, start LBA = n + 2568KB > > > > submit bio: size = 4KB, start LBA = n + 3080KB > > > > ... > > > > > > > > The original expectation was for the storage to receive large, > > > > contiguous requests. However, due to non-alignment, many small > I/O > > > > requests are generated. This problem is easily visible because > the > > > > user pages passed in are often allocated by the buddy system as > > > order 0 > > > > pages during page faults, resulting in highly non-contiguous > > > memory. > > > > > > If order 0 page is added to bio, the multipage bvec becomes nop > > > basically(256bvec holds 256 pages), then how can it make a > difference > > > for you? > > > > > > > > > > > > > > > > > As observed in the Antutu Sequential Read test below, it is > similar > > > to > > > > the description above where the splitting caused by the queue > limit > > > > leaves small requests sandwiched in between: > > > > > > > > block_bio_queue: 8,32 R 86925864 + 2144 [Thread-51] > > > > block_split: 8,32 R 86925864 / 86926888 [Thread-51] > > > > block_split: 8,32 R 86926888 / 86927912 [Thread-51] > > > > block_rq_issue: 8,32 R 524288 () 86925864 + 1024 [Thread-51] > > > > block_rq_issue: 8,32 R 524288 () 86926888 + 1024 [Thread-51] > > > > block_bio_queue: 8,32 R 86928008 + 2144 [Thread-51] > > > > block_split: 8,32 R 86928008 / 86929032 [Thread-51] > > > > block_split: 8,32 R 86929032 / 86930056 [Thread-51] > > > > block_rq_issue: 8,32 R 524288 () 86928008 + 1024 [Thread-51] > > > > block_rq_issue: 8,32 R 49152 () 86927912 + 96 [Thread-51] > > > > block_rq_issue: 8,32 R 524288 () 86929032 + 1024 [Thread-51] > > > > block_bio_queue: 8,32 R 86930152 + 2112 [Thread-51] > > > > block_split: 8,32 R 86930152 / 86931176 [Thread-51] > > > > block_split: 8,32 R 86931176 / 86932200 [Thread-51] > > > > block_rq_issue: 8,32 R 524288 () 86930152 + 1024 [Thread-51] > > > > block_rq_issue: 8,32 R 49152 () 86930056 + 96 [Thread-51] > > > > block_rq_issue: 8,32 R 524288 () 86931176 + 1024 [Thread-51] > > > > block_bio_queue: 8,32 R 86932264 + 2096 [Thread-51] > > > > block_split: 8,32 R 86932264 / 86933288 [Thread-51] > > > > block_split: 8,32 R 86933288 / 86934312 [Thread-51] > > > > block_rq_issue: 8,32 R 524288 () 86932264 + 1024 [Thread-51] > > > > block_rq_issue: 8,32 R 32768 () 86932200 + 64 [Thread-51] > > > > block_rq_issue: 8,32 R 524288 () 86933288 + 1024 [Thread-51] > > > > > > > > I simply prevents non-aligned situations in > bio_iov_iter_get_pages. > > > > > > But there is still 4KB IO left if you limit max bio size is > 512KB, > > > then how does this 4KB IO finally go in case of 1028KB IO? > > > > > > > Besides making the upper layer application aware of the queue > > > limit, I > > > > would appreciate any other directions or suggestions you may > have. > > > > > > The problem is related with IO size from application. > > > > > > If you send unaligned IO, you can't avoid the last IO with small > > > size, no > > > matter if block layer bio split is involved or not. Your patch > just > > > lets > > > __bio_iov_iter_get_pages split the bio, and you still have 4KB > left > > > finally when application submits 1028KB, right? > > > > > > Then I don't understand why your patch improves sequential IO > > > performance. > > > > > > Thanks, > > > Ming > > > > > > > The application performs I/O with a sufficitenly large I/O size, > > causing it to constantly fill up and submit full bios. However, in > the > > iomap direct I/O scenario, pages are added to the bio one by one > from > > the user buffer. This typically triggers page faults, resulting in > the > > allocation of order 0 pages from the buddy system. > > > > The remaining amount of each order in the buddy system varies over > > time. If there are not enough pages available in a particular > order, > > pages are split from higher orders. When pages are obtained from > the > > higher order, the user buffer may contain some small consecutive > > patterns. > > > > In summary, the physical layout of the user buffer is > unpredictable, > > and when it contains some small consecutive patterns, the size of > the > > bio becomes randomly unaligned during filling. > > Yes, multipage bvec depends on physical layout of user buffer, but it > doesn't affect bio size, which is decided by userspace, and the bvec > page layout doesn't affect IO pattern. > This will result in variable sizes of full bios when filling them, as the size of the bio_vec depends on the physical layout of the user buffer. > If userspace submits 1028K IO, the IO is split into 512K, 512K and > 4K, > no matter if each bvec includes how many pages. > Sorry for the confusion caused by my description of 1028KB. It was meant to illustrate the scenario of submitting an unaligned bio. In my opinion, the least ideal size for a full bio due to physical layout would be 1028KB, as it would result in an I/O with only 4KB. > If userspace submits very large IO, such as 512M, which will be split > into 1K bios with 512K size. > No. The block layer cannot receive a 512MB bio to split into 512K size. It will encounter full bios of various size, because the size of each bio_vec is based on the physical layout. > You patch doesn't makes any difference actually from block layer > viewpoint, such as: > > 1) dd if=/dev/ublkb0 of=/dev/null bs=1028k count=1 iflag=direct > > 2) observe IO pattern by the following bpftrace: > > kprobe:blk_mq_start_request > { > $rq = (struct request *)arg0; > > printf("%lu %16s %d %d: %s %s bio %lx-%lu\n", > nsecs / 1000, comm, pid, cpu, ksym(reg("ip")), > $rq->q->disk->disk_name, $rq->__sector, $rq->__data_len); > } > > 3) no matter if your patch is applied or not, the following pattern > is always observed: > > 117828811 dd 1475 0: blk_mq_start_request ublkb0 bio 0- > 524288 > 117828823 dd 1475 0: blk_mq_start_request ublkb0 bio > 400-524288 > 117828825 dd 1475 0: blk_mq_start_request ublkb0 bio > 800-4096 > > Similar pattern can be observed when running dd inside FS(xfs). > > > > > This patch limits the bio to be filled up to the max_sectors. The > > submission is an async operation, so once the bio is queued, it > will > > immediately return and continue filled and submit the next bio. > > bio split doesn't change this behavior too, the difference is just > how > many bios the upper layer(iomap) observed. > > Your patch just moves the split into upper layer, and iomap can see > 3 bios with your patch when `dd bs=1028K`, and in vanilla tree, iomap > just submits single bio with 1028K, block layer splits it into > 512k, 512k, and 4k. So finally UFS driver observes same IO pattern. > > In short, please root cause why your patch improves performance, or > please share your workloads, so we can observe the IO pattern and > related mm/fs behavior and try to root cause it. > > > Thanks, > Ming > Antutu Sequential Read performs 72 reads of 64MB using aio. I simulated a 64MB read using dd and observed that the actual bio queue sizes were slightly larger than 1MB. Not a single bio with 64MB, but rather multiple various size of bios. dd-5970[001] ..... 758.485446: block_bio_queue: 254,52 R 2933336 + 2136 dd-5970[001] ..... 758.487145: block_bio_queue: 254,52 R 2935472 + 2152 dd-5970[001] ..... 758.488822: block_bio_queue: 254,52 R 2937624 + 2128 dd-5970[001] ..... 758.490478: block_bio_queue: 254,52 R 2939752 + 2160 dd-5970[001] ..... 758.492154: block_bio_queue: 254,52 R 2941912 + 2096 dd-5970[001] ..... 758.493874: block_bio_queue: 254,52 R 2944008 + 2160