Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Aug 13, 2024 at 9:09 AM Yin Fengwei <fengwei.yin@xxxxxxxxx> wrote:
>
> On 8/12/24 00:49, Mateusz Guzik wrote:
> > On Mon, Aug 12, 2024 at 12:43:08PM +0800, Yin Fengwei wrote:
> >> Hi David,
> >>
> >> On 8/1/24 09:44, David Hildenbrand wrote:
> >>> On 01.08.24 15:37, Mateusz Guzik wrote:
> >>>> On Thu, Aug 1, 2024 at 3:34 PM David Hildenbrand <david@xxxxxxxxxx>
> >>>> wrote:
> >>>>>
> >>>>> On 01.08.24 15:30, Mateusz Guzik wrote:
> >>>>>> On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote:
> >>>>>>> Yes indeed. fork() can be extremely sensitive to each
> >>>>>>> added instruction.
> >>>>>>>
> >>>>>>> I even pointed out to Peter why I didn't add the
> >>>>>>> PageHuge check in there
> >>>>>>> originally [1].
> >>>>>>>
> >>>>>>> "Well, and I didn't want to have runtime-hugetlb checks in
> >>>>>>> PageAnonExclusive code called on certainly-not-hugetlb code paths."
> >>>>>>>
> >>>>>>>
> >>>>>>> We now have to do a page_folio(page) and then test for hugetlb.
> >>>>>>>
> >>>>>>>        return folio_test_hugetlb(page_folio(page));
> >>>>>>>
> >>>>>>> Nowadays, folio_test_hugetlb() will be faster than at
> >>>>>>> c0bff412e6 times, so
> >>>>>>> maybe at least part of the overhead is gone.
> >>>>>>>
> >>>>>>
> >>>>>> I'll note page_folio expands to a call to _compound_head.
> >>>>>>
> >>>>>> While _compound_head is declared as an inline, it ends up being big
> >>>>>> enough that the compiler decides to emit a real function instead and
> >>>>>> real func calls are not particularly cheap.
> >>>>>>
> >>>>>> I had a brief look with a profiler myself and for single-threaded usage
> >>>>>> the func is quite high up there, while it manages to get out with the
> >>>>>> first branch -- that is to say there is definitely performance lost for
> >>>>>> having a func call instead of an inlined branch.
> >>>>>>
> >>>>>> The routine is deinlined because of a call to page_fixed_fake_head,
> >>>>>> which itself is annotated with always_inline.
> >>>>>>
> >>>>>> This is of course patchable with minor shoveling.
> >>>>>>
> >>>>>> I did not go for it because stress-ng results were too unstable for me
> >>>>>> to confidently state win/loss.
> >>>>>>
> >>>>>> But should you want to whack the regression, this is what I would look
> >>>>>> into.
> >>>>>>
> >>>>>
> >>>>> This might improve it, at least for small folios I guess:
> >> Do you want us to test this change? Or you have further optimization
> >> ongoing? Thanks.
> >
> > I verified the thing below boots, I have no idea about performance. If
> > it helps it can be massaged later from style perspective.
> >
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index 5769fe6e4950..2d5d61ab385b 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -194,34 +194,13 @@ enum pageflags {
> >   #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> >   DECLARE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key);
> >
> > -/*
> > - * Return the real head page struct iff the @page is a fake head page, otherwise
> > - * return the @page itself. See Documentation/mm/vmemmap_dedup.rst.
> > - */
> > +const struct page *_page_fixed_fake_head(const struct page *page);
> > +
> >   static __always_inline const struct page *page_fixed_fake_head(const struct page *page)
> >   {
> >       if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key))
> >               return page;
> > -
> > -     /*
> > -      * Only addresses aligned with PAGE_SIZE of struct page may be fake head
> > -      * struct page. The alignment check aims to avoid access the fields (
> > -      * e.g. compound_head) of the @page[1]. It can avoid touch a (possibly)
> > -      * cold cacheline in some cases.
> > -      */
> > -     if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) &&
> > -         test_bit(PG_head, &page->flags)) {
> > -             /*
> > -              * We can safely access the field of the @page[1] with PG_head
> > -              * because the @page is a compound page composed with at least
> > -              * two contiguous pages.
> > -              */
> > -             unsigned long head = READ_ONCE(page[1].compound_head);
> > -
> > -             if (likely(head & 1))
> > -                     return (const struct page *)(head - 1);
> > -     }
> > -     return page;
> > +     return _page_fixed_fake_head(page);
> >   }
> >   #else
> >   static inline const struct page *page_fixed_fake_head(const struct page *page)
> > @@ -235,7 +214,7 @@ static __always_inline int page_is_fake_head(const struct page *page)
> >       return page_fixed_fake_head(page) != page;
> >   }
> >
> > -static inline unsigned long _compound_head(const struct page *page)
> > +static __always_inline unsigned long _compound_head(const struct page *page)
> >   {
> >       unsigned long head = READ_ONCE(page->compound_head);
> >
> > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > index 829112b0a914..3fbc00db607a 100644
> > --- a/mm/hugetlb_vmemmap.c
> > +++ b/mm/hugetlb_vmemmap.c
> > @@ -19,6 +19,33 @@
> >   #include <asm/tlbflush.h>
> >   #include "hugetlb_vmemmap.h"
> >
> > +/*
> > + * Return the real head page struct iff the @page is a fake head page, otherwise
> > + * return the @page itself. See Documentation/mm/vmemmap_dedup.rst.
> > + */
> > +const struct page *_page_fixed_fake_head(const struct page *page)
> > +{
> > +     /*
> > +      * Only addresses aligned with PAGE_SIZE of struct page may be fake head
> > +      * struct page. The alignment check aims to avoid access the fields (
> > +      * e.g. compound_head) of the @page[1]. It can avoid touch a (possibly)
> > +      * cold cacheline in some cases.
> > +      */
> > +     if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) &&
> > +         test_bit(PG_head, &page->flags)) {
> > +             /*
> > +              * We can safely access the field of the @page[1] with PG_head
> > +              * because the @page is a compound page composed with at least
> > +              * two contiguous pages.
> > +              */
> > +             unsigned long head = READ_ONCE(page[1].compound_head);
> > +
> > +             if (likely(head & 1))
> > +                     return (const struct page *)(head - 1);
> > +     }
> > +     return page;
> > +}
> > +
> >   /**
> >    * struct vmemmap_remap_walk - walk vmemmap page table
> >    *
> >
>
> The change can resolve the regression (from -3% to 0.5%):
>

thanks for testing

would you mind benchmarking the change which merely force-inlines _compund_page?

https://lore.kernel.org/linux-mm/66c4fcc5-47f6-438c-a73a-3af6e19c3200@xxxxxxxxxx/

> Please note:
>    9cb28da54643ad464c47585cd5866c30b0218e67 is the parent commit
>    3f16e4b516ef02d9461b7e0b6c50e05ba0811886 is the commit with above
>                                             patch
>    c0bff412e67b781d761e330ff9578aa9ed2be79e is the commit which
>                                             introduced regression
>
>
> =========================================================================================
> tbox_group/testcase/rootfs/kconfig/compiler/nr_threads/testtime/test/cpufreq_governor/debug-setup:
>
> lkp-icl-2sp8/stress-ng/debian-12-x86_64-20240206.cgz/x86_64-rhel-8.3/gcc-12/100%/60s/clone/performance/yfw_test2
>
> commit:
>    9cb28da54643ad464c47585cd5866c30b0218e67
>    3f16e4b516ef02d9461b7e0b6c50e05ba0811886
>    c0bff412e67b781d761e330ff9578aa9ed2be79e
>
> 9cb28da54643ad46 3f16e4b516ef02d9461b7e0b6c5 c0bff412e67b781d761e330ff95
> ---------------- --------------------------- ---------------------------
>         fail:runs  %reproduction    fail:runs  %reproduction    fail:runs
>             |             |             |             |             |
>            3:3            0%           3:3            0%           3:3
>    stress-ng.clone.microsecs_per_clone.pass
>            3:3            0%           3:3            0%           3:3
>    stress-ng.clone.pass
>           %stddev     %change         %stddev     %change         %stddev
>               \          |                \          |                \
>        2904            -0.6%       2886            +3.7%       3011
>    stress-ng.clone.microsecs_per_clone
>      563520            +0.5%     566296            -3.1%     546122
>    stress-ng.clone.ops
>        9306            +0.5%       9356            -3.0%       9024
>    stress-ng.clone.ops_per_sec
>
>
> BTW, the change needs to export symbol _page_fixed_fake_head otherwise
> some modules hit build error.
>

ok, I'll patch that up if this approach will be the thing to do

-- 
Mateusz Guzik <mjguzik gmail.com>





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux