On Tue, Mar 14, 2023 at 06:46:19PM -0400, Theodore Ts'o wrote: > On Thu, Jan 26, 2023 at 08:24:03PM +0000, Matthew Wilcox (Oracle) wrote: > > Remove a call to compound_head(). > > Same question as with other commits, plus one more; why is it notable > that calls to compound_head() are being reduced. I've looked at the > implementation, and it doesn't look all _that_ heavyweight.... It gets a lot more heavyweight when you turn on CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP, which more and more distro kernels are doing, because it's such a win for VM host kernels. eg SuSE do it here: https://github.com/SUSE/kernel-source/blob/master/config/x86_64/default and UEK does it here: https://github.com/oracle/linux-uek/blob/uek7/ga/uek-rpm/ol9/config-x86_64 Debian also has it enabled. It didn't use to be so expensive, but now it's something like 50-60 bytes of text per invocation on x86 [1]. And the compiler doesn't get to remember the result of calling compound_head() because we might have changed page->compound_head between invocations. It doesn't even know that compound_head() is idempotent. Anyway, each of these patches can be justified as "This patch shrinks the kernel by 0.0001%". Of course my real motivation for doing this is to reduce the number of callers of the page APIs so we can start to remove them and lessen the cognitive complexity of having both page & folio APIs that parallel each other. And I would like ext4 to support large folios sometime soon, and it's a step towards that goal too. But that's a lot to write out in each changelog. [1] For example, the disassembly of unlock_page() with the UEK config: c0: f3 0f 1e fa endbr64 c4: e8 00 00 00 00 call c9 <unlock_page+0x9> c5: R_X86_64_PLT32 __fentry__-0x4 c9: 55 push %rbp ca: 48 8b 47 08 mov 0x8(%rdi),%rax ce: 48 89 e5 mov %rsp,%rbp d1: a8 01 test $0x1,%al d3: 75 2f jne 104 <unlock_page+0x44> d5: eb 0b jmp e2 <unlock_page+0x22> d7: e8 00 00 00 00 call dc <unlock_page+0x1c> d8: R_X86_64_PLT32 folio_unlock-0x4 dc: 5d pop %rbp dd: e9 00 00 00 00 jmp e2 <unlock_page+0x22> de: R_X86_64_PLT32 __x86_return_thunk-0x4 e2: f7 c7 ff 0f 00 00 test $0xfff,%edi e8: 75 ed jne d7 <unlock_page+0x17> ea: 48 8b 07 mov (%rdi),%rax ed: a9 00 00 01 00 test $0x10000,%eax f2: 74 e3 je d7 <unlock_page+0x17> f4: 48 8b 47 48 mov 0x48(%rdi),%rax f8: 48 8d 50 ff lea -0x1(%rax),%rdx fc: a8 01 test $0x1,%al fe: 48 0f 45 fa cmovne %rdx,%rdi 102: eb d3 jmp d7 <unlock_page+0x17> 104: 48 8d 78 ff lea -0x1(%rax),%rdi 108: e8 00 00 00 00 call 10d <unlock_page+0x4d> 109: R_X86_64_PLT32 folio_unlock-0x4 10d: 5d pop %rbp 10e: e9 00 00 00 00 jmp 113 <unlock_page+0x53> 10f: R_X86_64_PLT32 __x86_return_thunk-0x4 113: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1) 11a: 00 00 00 00 11e: 66 90 xchg %ax,%ax Everything between 0xd3 and 0x104 is "maybe it's a fake head". That's 41 bytes as a minimum per callsite, and typically it's much more becase we also need the test for PageTail and the lea for the actual compound_head.