On Tue, Aug 13, 2024 at 9:09 AM Yin Fengwei <fengwei.yin@xxxxxxxxx> wrote: > > On 8/12/24 00:49, Mateusz Guzik wrote: > > On Mon, Aug 12, 2024 at 12:43:08PM +0800, Yin Fengwei wrote: > >> Hi David, > >> > >> On 8/1/24 09:44, David Hildenbrand wrote: > >>> On 01.08.24 15:37, Mateusz Guzik wrote: > >>>> On Thu, Aug 1, 2024 at 3:34 PM David Hildenbrand <david@xxxxxxxxxx> > >>>> wrote: > >>>>> > >>>>> On 01.08.24 15:30, Mateusz Guzik wrote: > >>>>>> On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote: > >>>>>>> Yes indeed. fork() can be extremely sensitive to each > >>>>>>> added instruction. > >>>>>>> > >>>>>>> I even pointed out to Peter why I didn't add the > >>>>>>> PageHuge check in there > >>>>>>> originally [1]. > >>>>>>> > >>>>>>> "Well, and I didn't want to have runtime-hugetlb checks in > >>>>>>> PageAnonExclusive code called on certainly-not-hugetlb code paths." > >>>>>>> > >>>>>>> > >>>>>>> We now have to do a page_folio(page) and then test for hugetlb. > >>>>>>> > >>>>>>> return folio_test_hugetlb(page_folio(page)); > >>>>>>> > >>>>>>> Nowadays, folio_test_hugetlb() will be faster than at > >>>>>>> c0bff412e6 times, so > >>>>>>> maybe at least part of the overhead is gone. > >>>>>>> > >>>>>> > >>>>>> I'll note page_folio expands to a call to _compound_head. > >>>>>> > >>>>>> While _compound_head is declared as an inline, it ends up being big > >>>>>> enough that the compiler decides to emit a real function instead and > >>>>>> real func calls are not particularly cheap. > >>>>>> > >>>>>> I had a brief look with a profiler myself and for single-threaded usage > >>>>>> the func is quite high up there, while it manages to get out with the > >>>>>> first branch -- that is to say there is definitely performance lost for > >>>>>> having a func call instead of an inlined branch. > >>>>>> > >>>>>> The routine is deinlined because of a call to page_fixed_fake_head, > >>>>>> which itself is annotated with always_inline. > >>>>>> > >>>>>> This is of course patchable with minor shoveling. > >>>>>> > >>>>>> I did not go for it because stress-ng results were too unstable for me > >>>>>> to confidently state win/loss. > >>>>>> > >>>>>> But should you want to whack the regression, this is what I would look > >>>>>> into. > >>>>>> > >>>>> > >>>>> This might improve it, at least for small folios I guess: > >> Do you want us to test this change? Or you have further optimization > >> ongoing? Thanks. > > > > I verified the thing below boots, I have no idea about performance. If > > it helps it can be massaged later from style perspective. > > > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > > index 5769fe6e4950..2d5d61ab385b 100644 > > --- a/include/linux/page-flags.h > > +++ b/include/linux/page-flags.h > > @@ -194,34 +194,13 @@ enum pageflags { > > #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP > > DECLARE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key); > > > > -/* > > - * Return the real head page struct iff the @page is a fake head page, otherwise > > - * return the @page itself. See Documentation/mm/vmemmap_dedup.rst. > > - */ > > +const struct page *_page_fixed_fake_head(const struct page *page); > > + > > static __always_inline const struct page *page_fixed_fake_head(const struct page *page) > > { > > if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key)) > > return page; > > - > > - /* > > - * Only addresses aligned with PAGE_SIZE of struct page may be fake head > > - * struct page. The alignment check aims to avoid access the fields ( > > - * e.g. compound_head) of the @page[1]. It can avoid touch a (possibly) > > - * cold cacheline in some cases. > > - */ > > - if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) && > > - test_bit(PG_head, &page->flags)) { > > - /* > > - * We can safely access the field of the @page[1] with PG_head > > - * because the @page is a compound page composed with at least > > - * two contiguous pages. > > - */ > > - unsigned long head = READ_ONCE(page[1].compound_head); > > - > > - if (likely(head & 1)) > > - return (const struct page *)(head - 1); > > - } > > - return page; > > + return _page_fixed_fake_head(page); > > } > > #else > > static inline const struct page *page_fixed_fake_head(const struct page *page) > > @@ -235,7 +214,7 @@ static __always_inline int page_is_fake_head(const struct page *page) > > return page_fixed_fake_head(page) != page; > > } > > > > -static inline unsigned long _compound_head(const struct page *page) > > +static __always_inline unsigned long _compound_head(const struct page *page) > > { > > unsigned long head = READ_ONCE(page->compound_head); > > > > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > > index 829112b0a914..3fbc00db607a 100644 > > --- a/mm/hugetlb_vmemmap.c > > +++ b/mm/hugetlb_vmemmap.c > > @@ -19,6 +19,33 @@ > > #include <asm/tlbflush.h> > > #include "hugetlb_vmemmap.h" > > > > +/* > > + * Return the real head page struct iff the @page is a fake head page, otherwise > > + * return the @page itself. See Documentation/mm/vmemmap_dedup.rst. > > + */ > > +const struct page *_page_fixed_fake_head(const struct page *page) > > +{ > > + /* > > + * Only addresses aligned with PAGE_SIZE of struct page may be fake head > > + * struct page. The alignment check aims to avoid access the fields ( > > + * e.g. compound_head) of the @page[1]. It can avoid touch a (possibly) > > + * cold cacheline in some cases. > > + */ > > + if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) && > > + test_bit(PG_head, &page->flags)) { > > + /* > > + * We can safely access the field of the @page[1] with PG_head > > + * because the @page is a compound page composed with at least > > + * two contiguous pages. > > + */ > > + unsigned long head = READ_ONCE(page[1].compound_head); > > + > > + if (likely(head & 1)) > > + return (const struct page *)(head - 1); > > + } > > + return page; > > +} > > + > > /** > > * struct vmemmap_remap_walk - walk vmemmap page table > > * > > > > The change can resolve the regression (from -3% to 0.5%): > thanks for testing would you mind benchmarking the change which merely force-inlines _compund_page? https://lore.kernel.org/linux-mm/66c4fcc5-47f6-438c-a73a-3af6e19c3200@xxxxxxxxxx/ > Please note: > 9cb28da54643ad464c47585cd5866c30b0218e67 is the parent commit > 3f16e4b516ef02d9461b7e0b6c50e05ba0811886 is the commit with above > patch > c0bff412e67b781d761e330ff9578aa9ed2be79e is the commit which > introduced regression > > > ========================================================================================= > tbox_group/testcase/rootfs/kconfig/compiler/nr_threads/testtime/test/cpufreq_governor/debug-setup: > > lkp-icl-2sp8/stress-ng/debian-12-x86_64-20240206.cgz/x86_64-rhel-8.3/gcc-12/100%/60s/clone/performance/yfw_test2 > > commit: > 9cb28da54643ad464c47585cd5866c30b0218e67 > 3f16e4b516ef02d9461b7e0b6c50e05ba0811886 > c0bff412e67b781d761e330ff9578aa9ed2be79e > > 9cb28da54643ad46 3f16e4b516ef02d9461b7e0b6c5 c0bff412e67b781d761e330ff95 > ---------------- --------------------------- --------------------------- > fail:runs %reproduction fail:runs %reproduction fail:runs > | | | | | > 3:3 0% 3:3 0% 3:3 > stress-ng.clone.microsecs_per_clone.pass > 3:3 0% 3:3 0% 3:3 > stress-ng.clone.pass > %stddev %change %stddev %change %stddev > \ | \ | \ > 2904 -0.6% 2886 +3.7% 3011 > stress-ng.clone.microsecs_per_clone > 563520 +0.5% 566296 -3.1% 546122 > stress-ng.clone.ops > 9306 +0.5% 9356 -3.0% 9024 > stress-ng.clone.ops_per_sec > > > BTW, the change needs to export symbol _page_fixed_fake_head otherwise > some modules hit build error. > ok, I'll patch that up if this approach will be the thing to do -- Mateusz Guzik <mjguzik gmail.com>