Re: [PATCH 19/31] ext4: Convert ext4_journalled_zero_new_buffers() to use a folio

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Fri, 24 Mar 2023 04:15:17 +0000

On Tue, Mar 14, 2023 at 06:46:19PM -0400, Theodore Ts'o wrote:
> On Thu, Jan 26, 2023 at 08:24:03PM +0000, Matthew Wilcox (Oracle) wrote:
> > Remove a call to compound_head().
> 
> Same question as with other commits, plus one more; why is it notable
> that calls to compound_head() are being reduced.  I've looked at the
> implementation, and it doesn't look all _that_ heavyweight....

It gets a lot more heavyweight when you turn on
CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP, which more and more distro
kernels are doing, because it's such a win for VM host kernels.
eg SuSE do it here:
https://github.com/SUSE/kernel-source/blob/master/config/x86_64/default
and UEK does it here:
https://github.com/oracle/linux-uek/blob/uek7/ga/uek-rpm/ol9/config-x86_64
Debian also has it enabled.

It didn't use to be so expensive, but now it's something like 50-60
bytes of text per invocation on x86 [1].  And the compiler doesn't get
to remember the result of calling compound_head() because we might have
changed page->compound_head between invocations.  It doesn't even know
that compound_head() is idempotent.

Anyway, each of these patches can be justified as "This patch shrinks
the kernel by 0.0001%".  Of course my real motivation for doing this
is to reduce the number of callers of the page APIs so we can start to
remove them and lessen the cognitive complexity of having both page &
folio APIs that parallel each other.  And I would like ext4 to support
large folios sometime soon, and it's a step towards that goal too.
But that's a lot to write out in each changelog.

[1] For example, the disassembly of unlock_page() with the UEK
config:

  c0:   f3 0f 1e fa             endbr64
  c4:   e8 00 00 00 00          call   c9 <unlock_page+0x9>
                        c5: R_X86_64_PLT32      __fentry__-0x4
  c9:   55                      push   %rbp
  ca:   48 8b 47 08             mov    0x8(%rdi),%rax
  ce:   48 89 e5                mov    %rsp,%rbp
  d1:   a8 01                   test   $0x1,%al
  d3:   75 2f                   jne    104 <unlock_page+0x44>
  d5:   eb 0b                   jmp    e2 <unlock_page+0x22>
  d7:   e8 00 00 00 00          call   dc <unlock_page+0x1c>
                        d8: R_X86_64_PLT32      folio_unlock-0x4
  dc:   5d                      pop    %rbp
  dd:   e9 00 00 00 00          jmp    e2 <unlock_page+0x22>
                        de: R_X86_64_PLT32      __x86_return_thunk-0x4
  e2:   f7 c7 ff 0f 00 00       test   $0xfff,%edi
  e8:   75 ed                   jne    d7 <unlock_page+0x17>
  ea:   48 8b 07                mov    (%rdi),%rax
  ed:   a9 00 00 01 00          test   $0x10000,%eax
  f2:   74 e3                   je     d7 <unlock_page+0x17>
  f4:   48 8b 47 48             mov    0x48(%rdi),%rax
  f8:   48 8d 50 ff             lea    -0x1(%rax),%rdx
  fc:   a8 01                   test   $0x1,%al
  fe:   48 0f 45 fa             cmovne %rdx,%rdi
 102:   eb d3                   jmp    d7 <unlock_page+0x17>
 104:   48 8d 78 ff             lea    -0x1(%rax),%rdi
 108:   e8 00 00 00 00          call   10d <unlock_page+0x4d>
                        109: R_X86_64_PLT32     folio_unlock-0x4
 10d:   5d                      pop    %rbp
 10e:   e9 00 00 00 00          jmp    113 <unlock_page+0x53>
                        10f: R_X86_64_PLT32     __x86_return_thunk-0x4
 113:   66 66 2e 0f 1f 84 00    data16 cs nopw 0x0(%rax,%rax,1)
 11a:   00 00 00 00
 11e:   66 90                   xchg   %ax,%ax

Everything between 0xd3 and 0x104 is "maybe it's a fake head".  That's 41
bytes as a minimum per callsite, and typically it's much more becase we
also need the test for PageTail and the lea for the actual compound_head.