On 04/06/2024 16:01, Sagi Grimberg wrote: > > > On 04/06/2024 11:24, Sagi Grimberg wrote: >> >> >> On 04/06/2024 7:27, Christoph Hellwig wrote: >>> On Tue, Jun 04, 2024 at 12:27:06AM +0300, Sagi Grimberg wrote: >>>>>> I still don't understand how a page in the middle of a contiguous range ends >>>>>> up coming from the slab while others don't. >>>>> I haven't investigate the origin of the IO >>>>> yet. I suspect the first 2 pages are the superblocks of the raid >>>>> (mdp_superblock_1 and bitmap_super_s) and the rest of the IO is the bitmap. >>>> Well, if these indeed are different origins and just *happen* to be a >>>> mixture >>>> of slab originated pages and non-slab pages combined together in a single >>>> bio of a bvec entry, >>>> I'd suspect that it would be more beneficial to split the bvec (essentially >>>> not allow bio_add_page >>>> to append the page to tail bvec depending on a queue limit (similar to how >>>> we handle sg gaps). I have investigated the origin of the IO. It's a bug in the md-bitmap code. It's a single IO that __write_sb_page() sends, it rounds up the io size to the optimal io size but doesn't check that the final size exceeds the amount of pages it allocated. The slab pages aren't allocated by the md-bitmap, they are pages that happens to be after the allocated pages. I'm applying a patch to the md subsystem asap. I have added some logs to test the theory: ... md: created bitmap (1 pages) for device md127 __write_sb_page before md_super_write. offset: 16, size: 262144. pfn: 0x53ee ### __write_sb_page before md_super_write. logging pages ### pfn: 0x53ee, slab: 0 pfn: 0x53ef, slab: 1 pfn: 0x53f0, slab: 0 pfn: 0x53f1, slab: 0 pfn: 0x53f2, slab: 0 pfn: 0x53f3, slab: 1 ... nvme_tcp: sendpage_ok - pfn: 0x53ee, len: 262144, offset: 0 skbuff: before sendpage_ok() - pfn: 0x53ee skbuff: before sendpage_ok() - pfn: 0x53ef WARNING at net/core/skbuff.c:6848 skb_splice_from_iter+0x142/0x450 skbuff: !sendpage_ok - pfn: 0x53ef. is_slab: 1, page_count: 1 ... There is only 1 page allocated for bitmap but __write_sb_page() tries to write 64 pages. >>> So you want to add a PageSlab check to bvec_try_merge_page? That sounds >>> fairly expensive.. >>> >> >> The check needs to happen somewhere apparently, and given that it will be gated by a queue flag >> only request queues that actually needed will suffer, but they will suffer anyways... > What I now see is that we will check PageSlab twice (bvec last index and append page) > and skb_splice_from_iter checks it again... How many times check we check this :) > > Would be great if the network stack can just check it once and fallback to page copy... Nevertheless I think a check in the network stack or when merging bio's is still necessary.