Re: [PATCH 1/1] block: Check the queue limit before bio submitting

Ed Tsai (蔡宗軒) <Ed.Tsai@xxxxxxxxxxxx> · Mon, 6 Nov 2023 01:33:53 +0000

On Sat, 2023-11-04 at 11:43 +0800, Ming Lei wrote:
>  On Sat, Nov 04, 2023 at 01:11:02AM +0000, Ed Tsai (蔡宗軒) wrote:
> > On Sat, 2023-11-04 at 00:20 +0800, Ming Lei wrote:
> > >  On Wed, Nov 01, 2023 at 02:23:26AM +0000, Ed Tsai (蔡宗軒) wrote:
> > > > On Wed, 2023-10-25 at 17:22 +0800, ed.tsai@xxxxxxxxxxxx wrote:
> > > > > From: Ed Tsai <ed.tsai@xxxxxxxxxxxx>
> > > > > 
> > > > > Referring to commit 07173c3ec276 ("block: enable multipage
> > > bvecs"),
> > > > > each bio_vec now holds more than one page, potentially
> exceeding
> > > > > 1MB in size and causing alignment issues with the queue
> limit.
> > > > > 
> > > > > In a sequential read/write scenario, the file system
> maximizes
> > > the
> > > > > bio's capacity before submitting. However, misalignment with
> the
> > > > > queue limit can result in the bio being split into smaller
> I/O
> > > > > operations.
> > > > > 
> > > > > For instance, assuming the maximum I/O size is set to 512KB
> and
> > > the
> > > > > memory is highly fragmented, resulting in each bio containing
> > > only
> > > > > one 2-pages bio_vec (i.e., bi_size = 1028KB). This would
> cause
> > > the
> > > > > bio to be split into two 512KB portions and one 4KB portion.
> As a
> > > > > result, the originally expected continuous large I/O
> operations
> > > are
> > > > > interspersed with many small I/O operations.
> > > > > 
> > > > > To address this issue, this patch adds a check for the
> > > max_sectors
> > > > > before submitting the bio. This allows the upper layers to
> > > > > proactively
> > > > > detect and handle alignment issues.
> > > > > 
> > > > > I performed the Antutu V10 Storage Test on a UFS 4.0 device,
> > > which
> > > > > resulted in a significant improvement in the Sequential test:
> > > > > 
> > > > > Sequential Read (average of 5 rounds):
> > > > > Original: 3033.7 MB/sec
> > > > > Patched: 3520.9 MB/sec
> > > > > 
> > > > > Sequential Write (average of 5 rounds):
> > > > > Original: 2225.4 MB/sec
> > > > > Patched: 2800.3 MB/sec
> > > > > 
> > > > > Signed-off-by: Ed Tsai <ed.tsai@xxxxxxxxxxxx>
> > > > > ---
> > > > >  block/bio.c | 6 ++++++
> > > > >  1 file changed, 6 insertions(+)
> > > > > 
> > > > > diff --git a/block/bio.c b/block/bio.c
> > > > > index 816d412c06e9..a4a1f775b9ea 100644
> > > > > --- a/block/bio.c
> > > > > +++ b/block/bio.c
> > > > > @@ -1227,6 +1227,7 @@ static int
> __bio_iov_iter_get_pages(struct
> > > bio
> > > > > *bio, struct iov_iter *iter)
> > > > >  iov_iter_extraction_t extraction_flags = 0;
> > > > >  unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
> > > > >  unsigned short entries_left = bio->bi_max_vecs - bio-
> >bi_vcnt;
> > > > > +struct queue_limits *lim = &bdev_get_queue(bio->bi_bdev)-
> > > > > >limits;
> > > > >  struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
> > > > >  struct page **pages = (struct page **)bv;
> > > > >  ssize_t size, left;
> > > > > @@ -1275,6 +1276,11 @@ static int
> __bio_iov_iter_get_pages(struct
> > > bio
> > > > > *bio, struct iov_iter *iter)
> > > > >  struct page *page = pages[i];
> > > > >  
> > > > >  len = min_t(size_t, PAGE_SIZE - offset, left);
> > > > > +if (bio->bi_iter.bi_size + len >
> > > > > +    lim->max_sectors << SECTOR_SHIFT) {
> > > > > +ret = left;
> > > > > +break;
> > > > > +}
> > > > >  if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> > > > >  ret = bio_iov_add_zone_append_page(bio, page,
> > > > > len,
> > > > >  offset);
> > > > > -- 
> > > > > 2.18.0
> > > > > 
> > > > 
> > > > 
> > > > Hi Jens,
> > > > 
> > > > Just to clarify any potential confusion, I would like to
> provide
> > > > further details based on the assumed scenario mentioned above.
> > > > 
> > > > When the upper layer continuously sends 1028KB full-sized bios
> for
> > > > sequential reads, the Block Layer sees the following sequence:
> > > > submit bio: size = 1028KB, start LBA = n
> > > > submit bio: size = 1028KB, start LBA = n + 1028KB 
> > > > submit bio: size = 1028KB, start LBA = n + 2056KB
> > > > ...
> > > > 
> > > > However, due to the queue limit restricting the I/O size to a
> > > maximum
> > > > of 512KB, the Block Layer splits into the following sequence:
> > > > submit bio: size = 512KB, start LBA = n
> > > > submit bio: size = 512KB, start LBA = n +  512KB
> > > > submit bio: size =   4KB, start LBA = n + 1024KB
> > > > submit bio: size = 512KB, start LBA = n + 1028KB
> > > > submit bio: size = 512KB, start LBA = n + 1540KB
> > > > submit bio: size =   4KB, start LBA = n + 2052KB
> > > > submit bio: size = 512KB, start LBA = n + 2056KB
> > > > submit bio: size = 512KB, start LBA = n + 2568KB
> > > > submit bio: size =   4KB, start LBA = n + 3080KB
> > > > ...
> > > > 
> > > > The original expectation was for the storage to receive large,
> > > > contiguous requests. However, due to non-alignment, many small
> I/O
> > > > requests are generated. This problem is easily visible because
> the
> > > > user pages passed in are often allocated by the buddy system as
> > > order 0
> > > > pages during page faults, resulting in highly non-contiguous
> > > memory.
> > > 
> > > If order 0 page is added to bio, the multipage bvec becomes nop
> > > basically(256bvec holds 256 pages), then how can it make a
> difference
> > > for you?
> > 
> > 
> > 
> > > 
> > > > 
> > > > As observed in the Antutu Sequential Read test below, it is
> similar
> > > to
> > > > the description above where the splitting caused by the queue
> limit
> > > > leaves small requests sandwiched in between:
> > > > 
> > > > block_bio_queue: 8,32 R 86925864 + 2144 [Thread-51]
> > > > block_split: 8,32 R 86925864 / 86926888 [Thread-51]
> > > > block_split: 8,32 R 86926888 / 86927912 [Thread-51]
> > > > block_rq_issue: 8,32 R 524288 () 86925864 + 1024 [Thread-51]
> > > > block_rq_issue: 8,32 R 524288 () 86926888 + 1024 [Thread-51]
> > > > block_bio_queue: 8,32 R 86928008 + 2144 [Thread-51]
> > > > block_split: 8,32 R 86928008 / 86929032 [Thread-51]
> > > > block_split: 8,32 R 86929032 / 86930056 [Thread-51]
> > > > block_rq_issue: 8,32 R 524288 () 86928008 + 1024 [Thread-51]
> > > > block_rq_issue: 8,32 R 49152 () 86927912 + 96 [Thread-51]
> > > > block_rq_issue: 8,32 R 524288 () 86929032 + 1024 [Thread-51]
> > > > block_bio_queue: 8,32 R 86930152 + 2112 [Thread-51]
> > > > block_split: 8,32 R 86930152 / 86931176 [Thread-51]
> > > > block_split: 8,32 R 86931176 / 86932200 [Thread-51]
> > > > block_rq_issue: 8,32 R 524288 () 86930152 + 1024 [Thread-51]
> > > > block_rq_issue: 8,32 R 49152 () 86930056 + 96 [Thread-51]
> > > > block_rq_issue: 8,32 R 524288 () 86931176 + 1024 [Thread-51]
> > > > block_bio_queue: 8,32 R 86932264 + 2096 [Thread-51]
> > > > block_split: 8,32 R 86932264 / 86933288 [Thread-51]
> > > > block_split: 8,32 R 86933288 / 86934312 [Thread-51]
> > > > block_rq_issue: 8,32 R 524288 () 86932264 + 1024 [Thread-51]
> > > > block_rq_issue: 8,32 R 32768 () 86932200 + 64 [Thread-51]
> > > > block_rq_issue: 8,32 R 524288 () 86933288 + 1024 [Thread-51]
> > > > 
> > > > I simply prevents non-aligned situations in
> bio_iov_iter_get_pages.
> > > 
> > > But there is still 4KB IO left if you limit max bio size is
> 512KB,
> > > then how does this 4KB IO finally go in case of 1028KB IO?
> > > 
> > > > Besides making the upper layer application aware of the queue
> > > limit, I
> > > > would appreciate any other directions or suggestions you may
> have.
> > > 
> > > The problem is related with IO size from application.
> > > 
> > > If you send unaligned IO, you can't avoid the last IO with small
> > > size, no
> > > matter if block layer bio split is involved or not. Your patch
> just
> > > lets
> > > __bio_iov_iter_get_pages split the bio, and you still have 4KB
> left
> > > finally when application submits 1028KB, right?
> > > 
> > > Then I don't understand why your patch improves sequential IO
> > > performance.
> > > 
> > > Thanks,
> > > Ming
> > > 
> > 
> > The application performs I/O with a sufficitenly large I/O size,
> > causing it to constantly fill up and submit full bios. However, in
> the
> > iomap direct I/O scenario, pages are added to the bio one by one
> from
> > the user buffer. This typically triggers page faults, resulting in
> the
> > allocation of order 0 pages from the buddy system.
> > 
> > The remaining amount of each order in the buddy system varies over
> > time. If there are not enough pages available in a particular
> order,
> > pages are split from higher orders. When pages are obtained from
> the
> > higher order, the user buffer may contain some small consecutive
> > patterns.
> > 
> > In summary, the physical layout of the user buffer is
> unpredictable,
> > and when it contains some small consecutive patterns, the size of
> the
> > bio becomes randomly unaligned during filling.
> 
> Yes, multipage bvec depends on physical layout of user buffer, but it
> doesn't affect bio size, which is decided by userspace, and the bvec
> page layout doesn't affect IO pattern.
> 
This will result in variable sizes of full bios when filling them, as
the size of the bio_vec depends on the physical layout of the user
buffer.

> If userspace submits 1028K IO, the IO is split into 512K, 512K and
> 4K,
> no matter if each bvec includes how many pages.
> 
Sorry for the confusion caused by my description of 1028KB. It was
meant to illustrate the scenario of submitting an unaligned bio. In my
opinion, the least ideal size for a full bio due to physical layout
would be 1028KB, as it would result in an I/O with only 4KB.

> If userspace submits very large IO, such as 512M, which will be split
> into 1K bios with 512K size.
> 
No. The block layer cannot receive a 512MB bio to split into 512K size.
It will encounter full bios of various size, because the size of each
bio_vec is based on the physical layout.

> You patch doesn't makes any difference actually from block layer
> viewpoint, such as:
> 
> 1) dd if=/dev/ublkb0 of=/dev/null bs=1028k count=1 iflag=direct
> 
> 2) observe IO pattern by the following bpftrace:
> 
> kprobe:blk_mq_start_request
> {
> $rq = (struct request *)arg0;
> 
> printf("%lu %16s %d %d: %s %s bio %lx-%lu\n",
> nsecs / 1000,  comm, pid, cpu, ksym(reg("ip")),
> $rq->q->disk->disk_name, $rq->__sector, $rq->__data_len);
> }
> 
> 3) no matter if your patch is applied or not, the following pattern
> is always observed:
> 
> 117828811               dd 1475 0: blk_mq_start_request ublkb0 bio 0-
> 524288
> 117828823               dd 1475 0: blk_mq_start_request ublkb0 bio
> 400-524288
> 117828825               dd 1475 0: blk_mq_start_request ublkb0 bio
> 800-4096
> 
> Similar pattern can be observed when running dd inside FS(xfs).
> 
> > 
> > This patch limits the bio to be filled up to the max_sectors. The
> > submission is an async operation, so once the bio is queued, it
> will
> > immediately return and continue filled and submit the next bio.
> 
> bio split doesn't change this behavior too, the difference is just
> how
> many bios the upper layer(iomap) observed.
> 
> Your patch just moves the split into upper layer, and iomap can see
> 3 bios with your patch when `dd bs=1028K`, and in vanilla tree, iomap
> just submits single bio with 1028K, block layer splits it into
> 512k, 512k, and 4k. So finally UFS driver observes same IO pattern.
> 
> In short, please root cause why your patch improves performance, or
> please share your workloads, so we can observe the IO pattern and
> related mm/fs behavior and try to root cause it.
> 
> 
> Thanks,
> Ming
> 

Antutu Sequential Read performs 72 reads of 64MB using aio. I simulated
a 64MB read using dd and observed that the actual bio queue sizes were 
slightly larger than 1MB. Not a single bio with 64MB, but rather
multiple various size of bios.

dd-5970[001] ..... 758.485446: block_bio_queue: 254,52 R 2933336 + 2136
dd-5970[001] ..... 758.487145: block_bio_queue: 254,52 R 2935472 + 2152
dd-5970[001] ..... 758.488822: block_bio_queue: 254,52 R 2937624 + 2128
dd-5970[001] ..... 758.490478: block_bio_queue: 254,52 R 2939752 + 2160
dd-5970[001] ..... 758.492154: block_bio_queue: 254,52 R 2941912 + 2096
dd-5970[001] ..... 758.493874: block_bio_queue: 254,52 R 2944008 + 2160