On Thu, Feb 13, 2025 at 08:00:40PM +0100, Ming Lei wrote: > PAGE_SIZE is applied in validating block device queue limits, this way is > very fragile and is wrong: > > - queue limits are read from hardware, which is often one readonly hardware > property > > - PAGE_SIZE is one config option which can be changed during build time. > > In RH lab, it has been found that max segment size of some mmc card is > less than 64K, then this kind of card can't be probed successfully when > same kernel is re-built with 64K PAGE_SIZE. > > Fix this issue by using BLK_MIN_SEGMENT_SIZE in related code for dealing > with queue limits and checking if bio needn't split as a hint. Define > BLK_MIN_SEGMENT_SIZE as 4K(minimized PAGE_SIZE). > > The following commits are depended for backporting: > > commit 6aeb4f836480 ("block: remove bio_add_pc_page") > commit 02ee5d69e3ba ("block: remove blk_rq_bio_prep") > commit b7175e24d6ac ("block: add a dma mapping iterator") > > Cc: Paul Bunyan <pbunyan@xxxxxxxxxx> > Cc: Yi Zhang <yi.zhang@xxxxxxxxxx> > Cc: Luis Chamberlain <mcgrof@xxxxxxxxxx> > Cc: John Garry <john.g.garry@xxxxxxxxxx> > Cc: Bart Van Assche <bvanassche@xxxxxxx> > Cc: Keith Busch <kbusch@xxxxxxxxxx> > Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx> > --- > V3: > - rephrase commit log & fix patch style(Christoph) > - more comment log(Christoph) > V2: > - cover bio_split_rw_at() > - add BLK_MIN_SEGMENT_SIZE > > block/blk-merge.c | 2 +- > block/blk-settings.c | 6 +++--- > block/blk.h | 8 ++++++-- > include/linux/blkdev.h | 2 ++ > 4 files changed, 12 insertions(+), 6 deletions(-) > > diff --git a/block/blk-merge.c b/block/blk-merge.c > index 15cd231d560c..b55c52a42303 100644 > --- a/block/blk-merge.c > +++ b/block/blk-merge.c > @@ -329,7 +329,7 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim, > > if (nsegs < lim->max_segments && > bytes + bv.bv_len <= max_bytes && > - bv.bv_offset + bv.bv_len <= PAGE_SIZE) { > + bv.bv_offset + bv.bv_len <= BLK_MIN_SEGMENT_SIZE) { In which cases this "new" condition (for systems where PAGE_SIZE > BLK_MIN_SEGMENT_SIZE) is going to be true? In my test case below, is always false, so it defaults to the else path. And I think that is going to be the "normal" case in these systems, is that right? Doing a 'quick' test using next-20250213 on a 16k PAGE_SIZE system with the NVMe driver and a 4k lbs disk + ~1h fio sequential writes, I get results indicating a write performance degradation of ~0.8%. This is due to the new loop condition doing 4k steps rather than PS. I guess it's going to be slighly worse the larger the PAGE_SIZE the system is, and bio? So, why not decreasing the minimum segment size for the cases it's actually needed rather than making it now the default? I've measured bio_split_rw_at latency in the above test with the following results: next-20250213: This is a power-of-2 time (nanoseconds) histogram for all nr segments cases for my DUT nvme drive: @ns[fio]: [128, 256) 11 | | [256, 512) 242268 | | [512, 1K) 16153055 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [1K, 2K) 172281 | | [2K, 4K) 3762 | | [4K, 8K) 1184 | | [8K, 16K) 7581 | | [16K, 32K) 1516 | | [32K, 64K) 124 | | [64K, 128K) 9 | | [128K, 256K) 4 | | [256K, 512K) 2 | | [512K, 1M) 1 | | Same power-of-2 histogram for 64 nr segments case only. Note: I couldn't trigger a much larger nr segment case. @latency[64]: [256, 512) 379481 |@@ | [512, 1K) 8096428 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [1K, 2K) 1592 | | [2K, 4K) 1181 | | [4K, 8K) 332 | | [8K, 16K) 3027 | | [16K, 32K) 669 | | [32K, 64K) 63 | | [64K, 128K) 5 | | [128K, 256K) 2 | | [256K, 512K) 1 | | next-20250213 + BLK_MIN_SEGMENT_SIZE patch: @ns[fio]: [128, 256) 3 | | [256, 512) 76345 | | [512, 1K) 15514581 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [1K, 2K) 755336 |@@ | [2K, 4K) 7063 | | [4K, 8K) 3301 | | [8K, 16K) 4574 | | [16K, 32K) 1816 | | [32K, 64K) 155 | | [64K, 128K) 22 | | [128K, 256K) 6 | | [256K, 512K) 4 | | [512K, 1M) 2 | | [1M, 2M) 2 | | [2M, 4M) 1 | | [4M, 8M) 0 | | [8M, 16M) 1 | | @latency[64]: [256, 512) 6800 | | [512, 1K) 7540158 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [1K, 2K) 5747 | | [2K, 4K) 2578 | | [4K, 8K) 996 | | [8K, 16K) 1960 | | [16K, 32K) 773 | | [32K, 64K) 54 | | [64K, 128K) 5 | | [128K, 256K) 3 | | [256K, 512K) 3 | | [512K, 1M) 2 | | So comparing the above, I notice that the new routine is slightly slower, resulting in a minor degradation. Do you think is worth exploring a fix and returning PAGE_SIZE as BLK_MIN_SEGMENT_SIZE for the cases we don't need to go further? > nsegs++; > bytes += bv.bv_len; > } else { > diff --git a/block/blk-settings.c b/block/blk-settings.c > index c44dadc35e1e..539a64ad7989 100644 > --- a/block/blk-settings.c > +++ b/block/blk-settings.c > @@ -303,7 +303,7 @@ int blk_validate_limits(struct queue_limits *lim) > max_hw_sectors = min_not_zero(lim->max_hw_sectors, > lim->max_dev_sectors); > if (lim->max_user_sectors) { > - if (lim->max_user_sectors < PAGE_SIZE / SECTOR_SIZE) > + if (lim->max_user_sectors < BLK_MIN_SEGMENT_SIZE / SECTOR_SIZE) > return -EINVAL; > lim->max_sectors = min(max_hw_sectors, lim->max_user_sectors); > } else if (lim->io_opt > (BLK_DEF_MAX_SECTORS_CAP << SECTOR_SHIFT)) { > @@ -341,7 +341,7 @@ int blk_validate_limits(struct queue_limits *lim) > */ > if (!lim->seg_boundary_mask) > lim->seg_boundary_mask = BLK_SEG_BOUNDARY_MASK; > - if (WARN_ON_ONCE(lim->seg_boundary_mask < PAGE_SIZE - 1)) > + if (WARN_ON_ONCE(lim->seg_boundary_mask < BLK_MIN_SEGMENT_SIZE - 1)) > return -EINVAL; > > /* > @@ -362,7 +362,7 @@ int blk_validate_limits(struct queue_limits *lim) > */ > if (!lim->max_segment_size) > lim->max_segment_size = BLK_MAX_SEGMENT_SIZE; > - if (WARN_ON_ONCE(lim->max_segment_size < PAGE_SIZE)) > + if (WARN_ON_ONCE(lim->max_segment_size < BLK_MIN_SEGMENT_SIZE)) > return -EINVAL; > } > > diff --git a/block/blk.h b/block/blk.h > index 90fa5f28ccab..0eca1687bec4 100644 > --- a/block/blk.h > +++ b/block/blk.h > @@ -358,8 +358,12 @@ struct bio *bio_split_zone_append(struct bio *bio, > static inline bool bio_may_need_split(struct bio *bio, > const struct queue_limits *lim) > { > - return lim->chunk_sectors || bio->bi_vcnt != 1 || > - bio->bi_io_vec->bv_len + bio->bi_io_vec->bv_offset > PAGE_SIZE; > + if (lim->chunk_sectors) > + return true; > + if (bio->bi_vcnt != 1) > + return true; > + return bio->bi_io_vec->bv_len + bio->bi_io_vec->bv_offset > > + BLK_MIN_SEGMENT_SIZE; > } > > /** > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index 248416ecd01c..2021b2174268 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -1163,6 +1163,8 @@ static inline bool bdev_is_partition(struct block_device *bdev) > enum blk_default_limits { > BLK_MAX_SEGMENTS = 128, > BLK_SAFE_MAX_SECTORS = 255, > + /* use minimized PAGE_SIZE as min segment size hint */ > + BLK_MIN_SEGMENT_SIZE = 4096, > BLK_MAX_SEGMENT_SIZE = 65536, > BLK_SEG_BOUNDARY_MASK = 0xFFFFFFFFUL, > }; > -- > 2.47.1 >