On 3/24/24 10:24 PM, Matthew Wilcox wrote: ...
It's complicated. On the one hand, it's "more likely" because there are more tail pages than there are head pages or order-0 pages. On the other hand, a _lot_ of the time we call compound_head(), it's done with a non-tail page because we tend to pass around head pages (eg,
ah yes, that's true.
pmd_page() on hugetlbfs, or looking up a folio in the page cache and passing &folio->page to some function that's not yet converted. On the third hand, does the compiler really do much with the annotation? Before your patch: 27d6: a8 01 test $0x1,%al 27d8: 75 02 jne 27dc <clear_refs_pte_range+0x9c>
I should have thought to check this. Usually I'll see a change between je/jne if __builtin_expect is doing its job. Here it is, oddly, missing in action. Maybe I'll look a little closer into why that is...
27da: eb 59 jmp 2835 <clear_refs_pte_range+0xf5> 27dc: 49 8b 44 24 08 mov 0x8(%r12),%rax 27e1: a8 01 test $0x1,%al 27e3: 75 6f jne 2854 <clear_refs_pte_range+0x114> 27e5: eb 73 jmp 285a <clear_refs_pte_range+0x11a> With your patch: 1ee6: a8 01 test $0x1,%al 1ee8: 75 02 jne 1eec <clear_refs_pte_range+0x9c> 1eea: eb 5f jmp 1f4b <clear_refs_pte_range+0xfb> 1eec: 49 8b 44 24 08 mov 0x8(%r12),%rax 1ef1: a8 01 test $0x1,%al 1ef3: 75 50 jne 1f45 <clear_refs_pte_range+0xf5> 1ef5: eb 6c jmp 1f63 <clear_refs_pte_range+0x113> Looks pretty much the same. bloat-o-meter says: $ ./scripts/bloat-o-meter before.o after.o add/remove: 0/0 grow/shrink: 2/4 up/down: 32/-48 (-16) Function old new delta gather_stats.constprop 730 753 +23 smaps_hugetlb_range 635 644 +9 smaps_page_accumulate 342 338 -4 clear_refs_pte_range 339 328 -11 pagemap_hugetlb_range 422 407 -15 smaps_pte_range 1406 1388 -18 Total: Before=20066, After=20050, chg -0.08% (I was looking at clear_refs_pte_range above). This seems marginal. The benefits of removing a call to compound_head are much less ambiguous: $ ./scripts/bloat-o-meter before.o .build/fs/proc/task_mmu.o add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-101 (-101) Function old new delta clear_refs_pte_range 339 238 -101 Total: Before=20066, After=19965, chg -0.50% I'd describe that as replacing four calls to compound_head() with two: - page = pmd_page(*pmd); + folio = page_folio(pmd_page(*pmd)); /* Clear accessed and referenced bits. */ pmdp_test_and_clear_young(vma, addr, pmd); - test_and_clear_page_young(page); - ClearPageReferenced(page); + folio_test_clear_young(folio); + folio_clear_referenced(folio); ... - page = vm_normal_page(vma, addr, ptent); - if (!page) + folio = vm_normal_folio(vma, addr, ptent); + if (!folio) continue; /* Clear accessed and referenced bits. */ ptep_test_and_clear_young(vma, addr, pte); - test_and_clear_page_young(page); - ClearPageReferenced(page); + folio_test_clear_young(folio); + folio_clear_referenced(folio); I'm not saying this patch is necessarily wrong, I just think it's "not proven".
I appreciate your looking at this and explaining the analysis steps you used! thanks, -- John Hubbard NVIDIA