After taking a closer look, I don't see chunk_sectors being accounted for in the splitting code. If I understand correctly, the implementation of chunk_sectors is such that it allows for one big IO to go through. In other words, it tries to avoid the splitting code if possible. But, it doesn't seem to guarantee that when an IO is split, it will be split at configured alignment. With commit 07173c3ec276cbb18dc0e0687d37d310e98a1480 ("block: enable multipage bvecs"), we now see bigger multi-page IOs. So, if an app sent out multiple writes, those all would get combined into one big IO. "chunk_requests" would allow that one big IO to go through *ONLY IF* the total size is < max_sectors. If that's not the case, chunk_sectors doesn't seem to have control on how this big IO will now be split. Please correct me if I'm wrong. On Mon, Jun 15, 2020 at 7:50 PM harshad shirwadkar <harshadshirwadkar@xxxxxxxxx> wrote: > > Thanks for the feedback everyone, I totally overlook chunk_sectors and > it sounds like that's exactly what we want. I'll send another patch > where that becomes writable. > > > On Mon, Jun 15, 2020 at 7:40 PM Ming Lei <ming.lei@xxxxxxxxxx> wrote: > > > > On Mon, Jun 15, 2020 at 05:56:33PM -0700, Harshad Shirwadkar wrote: > > > This feature allows the user to control the alignment at which request > > > queue is allowed to split bios. Google CloudSQL's 16k user space > > > application expects that direct io writes aligned at 16k boundary in > > > the user-space are not split by kernel at non-16k boundaries. More > > > details about this feature can be found in CloudSQL's Cloud Next 2018 > > > presentation[1]. The underlying block device is capable of performing > > > 16k aligned writes atomically. Thus, this allows the user-space SQL > > > application to avoid double-writes (to protect against partial > > > failures) which are very costly provided that these writes are not > > > split at non-16k boundary by any underlying layers. > > > > > > We make use of Ext4's bigalloc feature to ensure that writes issued by > > > Ext4 are 16k aligned. But, 16K aligned data writes may get merged with > > > contiguous non-16k aligned Ext4 metadata writes. Such a write request > > > would be broken by the kernel only guaranteeing that the individually > > > split requests are physical block size aligned. > > > > > > We started observing a significant increase in 16k unaligned splits in > > > 5.4. Bisect points to commit 07173c3ec276cbb18dc0e0687d37d310e98a1480 > > > ("block: enable multipage bvecs"). This patch enables multipage bvecs > > > resulting in multiple 16k aligned writes issued by the user-space to > > > be merged into one big IO at first. Later, __blk_queue_split() splits > > > these IOs while trying to align individual split IOs to be physical > > > block size. > > > > > > Newly added split_alignment parameter is the alignment at which > > > requeust queue is allowed to split IO request. By default this > > > alignment is turned off and current behavior is unchanged. > > > > > > > Such alignment can be reached via q->limits.chunk_sectors, and you > > just need to expose it via sysfs and make it writable. > > > > Thanks, > > Ming > >