Thanks for the feedback everyone, I totally overlook chunk_sectors and it sounds like that's exactly what we want. I'll send another patch where that becomes writable. On Mon, Jun 15, 2020 at 7:40 PM Ming Lei <ming.lei@xxxxxxxxxx> wrote: > > On Mon, Jun 15, 2020 at 05:56:33PM -0700, Harshad Shirwadkar wrote: > > This feature allows the user to control the alignment at which request > > queue is allowed to split bios. Google CloudSQL's 16k user space > > application expects that direct io writes aligned at 16k boundary in > > the user-space are not split by kernel at non-16k boundaries. More > > details about this feature can be found in CloudSQL's Cloud Next 2018 > > presentation[1]. The underlying block device is capable of performing > > 16k aligned writes atomically. Thus, this allows the user-space SQL > > application to avoid double-writes (to protect against partial > > failures) which are very costly provided that these writes are not > > split at non-16k boundary by any underlying layers. > > > > We make use of Ext4's bigalloc feature to ensure that writes issued by > > Ext4 are 16k aligned. But, 16K aligned data writes may get merged with > > contiguous non-16k aligned Ext4 metadata writes. Such a write request > > would be broken by the kernel only guaranteeing that the individually > > split requests are physical block size aligned. > > > > We started observing a significant increase in 16k unaligned splits in > > 5.4. Bisect points to commit 07173c3ec276cbb18dc0e0687d37d310e98a1480 > > ("block: enable multipage bvecs"). This patch enables multipage bvecs > > resulting in multiple 16k aligned writes issued by the user-space to > > be merged into one big IO at first. Later, __blk_queue_split() splits > > these IOs while trying to align individual split IOs to be physical > > block size. > > > > Newly added split_alignment parameter is the alignment at which > > requeust queue is allowed to split IO request. By default this > > alignment is turned off and current behavior is unchanged. > > > > Such alignment can be reached via q->limits.chunk_sectors, and you > just need to expose it via sysfs and make it writable. > > Thanks, > Ming >