On 19/04/24 03:16PM, Matthew Wilcox wrote:
On Fri, Apr 19, 2024 at 02:47:21PM +0530, Kundan Kumar wrote:
When mTHP is enabled, IO can contain larger folios instead of pages.
In such cases add a larger size to the bio instead of looping through
pages. This reduces the overhead of iterating through pages for larger
block sizes. perf diff before and after this change:
Perf diff for write I/O with 128K block size:
1.22% -0.97% [kernel.kallsyms] [k] bio_iov_iter_get_pages
Perf diff for read I/O with 128K block size:
4.13% -3.26% [kernel.kallsyms] [k] bio_iov_iter_get_pages
I'm a bit confused by this to be honest. We already merge adjacent
pages, and it doesn't look to be _that_ expensive. Can you drill down
any further in the perf stats and show what the expensive part is?
Majority of the overhead exists due to repeated call to bvec_try_merge_page().
For a 128K size I/O we would call this function 32 times. The
bvec_try_merge_page() function does comparisions and calculations which adds
to overall overhead[1]
Function bio_iov_iter_get_pages() shows reduction of overhead at these places[2]
This patch reduces overhead as evident from perf diff:
4.17% -3.21% [kernel.kallsyms] [k] bio_iov_iter_get_pages
Also
5.54% [kernel.kallsyms] [k] bvec_try_merge_page
the above perf diff has been obtained by running fio command[3]
Note : These experiments have been done after enabling mTHP, where we get
one folio for a 128K I/O.
[1]
I. : 14 size_t bv_end = bv->bv_offset + bv->bv_len;
: 15 phys_addr_t vec_end_addr = page_to_phys(bv->bv_page) + bv_end - 1;
: 16 phys_addr_t page_addr = page_to_phys(page);
: 18 if (vec_end_addr + 1 != page_addr + off)
3.21 : ffffffff817ac796: mov %ecx,%eax
: 20 {
1.40 : ffffffff817ac798: mov %rsp,%rbp
3.14 : ffffffff817ac79b: push %r15
2.35 : ffffffff817ac79d: push %r14
3.13 : ffffffff817ac7a4: push %r12
II. : 113 if (bv->bv_page + bv_end / PAGE_SIZE != page + off / PAGE_SIZE)
1.84 : ffffffff817ac83e: shr $0xc,%ecx
3.09 : ffffffff817ac841: shr $0xc,%r15
8.52 : ffffffff817ac84d: add 0x0(%r13),%r15
0.65 : ffffffff817ac851: add %rcx,%r14
0.62 : ffffffff817ac854: cmp %r14,%r15
0.61 : ffffffff817ac857: je ffffffff817ac86e <bvec_try_merge_page+0xde>
[2]
I. : 206 struct page *page = pages[i];
4.92 : ffffffff817af307: mov -0x40(%rbp),%rdx
3.97 : ffffffff817af30b: mov %r13d,%eax
II. : 198 for (left = size, i = 0; left > 0; left -= len, i++) {
0.95 : ffffffff817af2f0: add $0x1,%r13d
4.80 : ffffffff817af2f6: sub %rbx,%r12
2.87 : ffffffff817af2f9: test %r12,%r12
III. : 167 if (WARN_ON_ONCE(bio->bi_iter.bi_size > UINT_MAX - len))
3.91 : ffffffff817af295: jb ffffffff817af547 <bio_iov_iter_get_pages+0x3e7>
: 169 if (bio->bi_vcnt > 0 &&
2.98 : ffffffff817af2a0: test %ax,%ax
: 173 bvec_try_merge_page(&bio->bi_io_vec[bio->bi_vcnt - 1],
3.45 : ffffffff817af2a5: movzwl %ax,%edi
1.07 : ffffffff817af2ab: lea -0x41(%rbp),%r8
3.08 : ffffffff817af2af: mov %r10,-0x50(%rbp)
5.77 : ffffffff817af2b3: sub $0x1,%edi
0.96 : ffffffff817af2b6: mov %ebx,-0x58(%rbp)
0.95 : ffffffff817af2b9: movslq %edi,%rdi
2.88 : ffffffff817af2bc: shl $0x4,%rdi
0.96 : ffffffff817af2c0: add 0x70(%r15),%rdi
[3]
perf record -o fio_128k_block_read.data fio -iodepth=128 -iomem_align=128K\
-iomem=mmap -rw=randread -direct=1 -ioengine=io_uring -bs=128K -numjobs=1\
-runtime=1m -group_reporting -filename=/dev/nvme1n1 -name=io_uring_test
--Kundan