Re: [PATCH] block : add larger order folio size instead of pages

Kundan Kumar <kundan.kumar@xxxxxxxxxxx> · Mon, 22 Apr 2024 15:14:58 +0530

On 19/04/24 03:16PM, Matthew Wilcox wrote:
On Fri, Apr 19, 2024 at 02:47:21PM +0530, Kundan Kumar wrote:
When mTHP is enabled, IO can contain larger folios instead of pages.
In such cases add a larger size to the bio instead of looping through
pages. This reduces the overhead of iterating through pages for larger
block sizes. perf diff before and after this change:

Perf diff for write I/O with 128K block size:
	1.22%     -0.97%  [kernel.kallsyms]  [k] bio_iov_iter_get_pages
Perf diff for read I/O with 128K block size:
	4.13%     -3.26%  [kernel.kallsyms]  [k] bio_iov_iter_get_pages

I'm a bit confused by this to be honest.  We already merge adjacent
pages, and it doesn't look to be _that_ expensive.  Can you drill down
any further in the perf stats and show what the expensive part is?

Majority of the overhead exists due to repeated call to bvec_try_merge_page().
For a 128K size I/O we would call this function 32 times. The
bvec_try_merge_page() function does comparisions and calculations which adds
to overall overhead[1]

Function bio_iov_iter_get_pages() shows reduction of overhead at these places[2]

This patch reduces overhead as evident from perf diff:
    4.17%     -3.21%  [kernel.kallsyms]  [k] bio_iov_iter_get_pages
Also
    5.54%             [kernel.kallsyms]  [k] bvec_try_merge_page

the above perf diff has been obtained by running fio command[3]

Note : These experiments have been done after enabling mTHP, where we get
one folio for a 128K I/O.

[1]
I.       : 14               size_t bv_end = bv->bv_offset + bv->bv_len;
        : 15               phys_addr_t vec_end_addr = page_to_phys(bv->bv_page) + bv_end - 1;
        : 16               phys_addr_t page_addr = page_to_phys(page);
        : 18               if (vec_end_addr + 1 != page_addr + off)
   3.21 :   ffffffff817ac796:       mov    %ecx,%eax
        : 20               {
   1.40 :   ffffffff817ac798:       mov    %rsp,%rbp
   3.14 :   ffffffff817ac79b:       push   %r15
   2.35 :   ffffffff817ac79d:       push   %r14
   3.13 :   ffffffff817ac7a4:       push   %r12

II.       : 113              if (bv->bv_page + bv_end / PAGE_SIZE != page + off / PAGE_SIZE)
   1.84 :   ffffffff817ac83e:       shr    $0xc,%ecx
   3.09 :   ffffffff817ac841:       shr    $0xc,%r15
   8.52 :   ffffffff817ac84d:       add    0x0(%r13),%r15
   0.65 :   ffffffff817ac851:       add    %rcx,%r14
   0.62 :   ffffffff817ac854:       cmp    %r14,%r15
   0.61 :   ffffffff817ac857:       je     ffffffff817ac86e <bvec_try_merge_page+0xde>

[2]
I.           : 206              struct page *page = pages[i];
   4.92 :   ffffffff817af307:       mov    -0x40(%rbp),%rdx
   3.97 :   ffffffff817af30b:       mov    %r13d,%eax

II.       : 198              for (left = size, i = 0; left > 0; left -= len, i++) {
   0.95 :   ffffffff817af2f0:       add    $0x1,%r13d
   4.80 :   ffffffff817af2f6:       sub    %rbx,%r12
   2.87 :   ffffffff817af2f9:       test   %r12,%r12

III.     : 167              if (WARN_ON_ONCE(bio->bi_iter.bi_size > UINT_MAX - len))
   3.91 :   ffffffff817af295:       jb     ffffffff817af547 <bio_iov_iter_get_pages+0x3e7>
        : 169              if (bio->bi_vcnt > 0 &&
   2.98 :   ffffffff817af2a0:       test   %ax,%ax
        : 173              bvec_try_merge_page(&bio->bi_io_vec[bio->bi_vcnt - 1],
   3.45 :   ffffffff817af2a5:       movzwl %ax,%edi
   1.07 :   ffffffff817af2ab:       lea    -0x41(%rbp),%r8
   3.08 :   ffffffff817af2af:       mov    %r10,-0x50(%rbp)
   5.77 :   ffffffff817af2b3:       sub    $0x1,%edi
   0.96 :   ffffffff817af2b6:       mov    %ebx,-0x58(%rbp)
   0.95 :   ffffffff817af2b9:       movslq %edi,%rdi
   2.88 :   ffffffff817af2bc:       shl    $0x4,%rdi
   0.96 :   ffffffff817af2c0:       add    0x70(%r15),%rdi

[3]
perf record -o fio_128k_block_read.data fio -iodepth=128 -iomem_align=128K\
-iomem=mmap -rw=randread -direct=1 -ioengine=io_uring -bs=128K -numjobs=1\
-runtime=1m -group_reporting -filename=/dev/nvme1n1 -name=io_uring_test

--Kundan