On Mon, Aug 12, 2024 at 10:12 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 12.08.24 06:49, Mateusz Guzik wrote: > > On Mon, Aug 12, 2024 at 12:43:08PM +0800, Yin Fengwei wrote: > >> Hi David, > >> > >> On 8/1/24 09:44, David Hildenbrand wrote: > >>> On 01.08.24 15:37, Mateusz Guzik wrote: > >>>> On Thu, Aug 1, 2024 at 3:34 PM David Hildenbrand <david@xxxxxxxxxx> > >>>> wrote: > >>>>> > >>>>> On 01.08.24 15:30, Mateusz Guzik wrote: > >>>>>> On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote: > >>>>>>> Yes indeed. fork() can be extremely sensitive to each > >>>>>>> added instruction. > >>>>>>> > >>>>>>> I even pointed out to Peter why I didn't add the > >>>>>>> PageHuge check in there > >>>>>>> originally [1]. > >>>>>>> > >>>>>>> "Well, and I didn't want to have runtime-hugetlb checks in > >>>>>>> PageAnonExclusive code called on certainly-not-hugetlb code paths." > >>>>>>> > >>>>>>> > >>>>>>> We now have to do a page_folio(page) and then test for hugetlb. > >>>>>>> > >>>>>>> return folio_test_hugetlb(page_folio(page)); > >>>>>>> > >>>>>>> Nowadays, folio_test_hugetlb() will be faster than at > >>>>>>> c0bff412e6 times, so > >>>>>>> maybe at least part of the overhead is gone. > >>>>>>> > >>>>>> > >>>>>> I'll note page_folio expands to a call to _compound_head. > >>>>>> > >>>>>> While _compound_head is declared as an inline, it ends up being big > >>>>>> enough that the compiler decides to emit a real function instead and > >>>>>> real func calls are not particularly cheap. > >>>>>> > >>>>>> I had a brief look with a profiler myself and for single-threaded usage > >>>>>> the func is quite high up there, while it manages to get out with the > >>>>>> first branch -- that is to say there is definitely performance lost for > >>>>>> having a func call instead of an inlined branch. > >>>>>> > >>>>>> The routine is deinlined because of a call to page_fixed_fake_head, > >>>>>> which itself is annotated with always_inline. > >>>>>> > >>>>>> This is of course patchable with minor shoveling. > >>>>>> > >>>>>> I did not go for it because stress-ng results were too unstable for me > >>>>>> to confidently state win/loss. > >>>>>> > >>>>>> But should you want to whack the regression, this is what I would look > >>>>>> into. > >>>>>> > >>>>> > >>>>> This might improve it, at least for small folios I guess: > >> Do you want us to test this change? Or you have further optimization > >> ongoing? Thanks. > > > > I verified the thing below boots, I have no idea about performance. If > > it helps it can be massaged later from style perspective. > > As quite a lot of setups already run with the vmemmap optimization enabled, I > wonder how effective this would be (might need more fine tuning, did not look > at the generated code): > > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 085dd8dcbea2..7ddcdbd712ec 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -233,7 +233,7 @@ static __always_inline int page_is_fake_head(const struct page *page) > return page_fixed_fake_head(page) != page; > } > > -static inline unsigned long _compound_head(const struct page *page) > +static __always_inline unsigned long _compound_head(const struct page *page) > { > unsigned long head = READ_ONCE(page->compound_head); > > Well one may need to justify it with bloat-o-meter which is why I did not just straight up inline the entire thing. But if you are down to fight opposition of the sort I agree this is the patch to benchmark. :) -- Mateusz Guzik <mjguzik gmail.com>