Re: [PATCH] mm/page_alloc: make sure __rmqueue() etc. always inline

"Lu, Aaron" <aaron.lu@xxxxxxxxx> · Wed, 18 Oct 2017 01:53:47 +0000

On Tue, 2017-10-17 at 13:32 +0200, Vlastimil Babka wrote:
> On 10/13/2017 08:31 AM, Aaron Lu wrote:
> > __rmqueue(), __rmqueue_fallback(), __rmqueue_smallest() and
> > __rmqueue_cma_fallback() are all in page allocator's hot path and
> > better be finished as soon as possible. One way to make them faster
> > is by making them inline. But as Andrew Morton and Andi Kleen pointed
> > out:
> > https://lkml.org/lkml/2017/10/10/1252
> > https://lkml.org/lkml/2017/10/10/1279
> > To make sure they are inlined, we should use __always_inline for them.
> > 
> > With the will-it-scale/page_fault1/process benchmark, when using nr_cpu
> > processes to stress buddy, the results for will-it-scale.processes with
> > and without the patch are:
> > 
> > On a 2-sockets Intel-Skylake machine:
> > 
> >  compiler          base        head
> > gcc-4.4.7       6496131     6911823 +6.4%
> > gcc-4.9.4       7225110     7731072 +7.0%
> > gcc-5.4.1       7054224     7688146 +9.0%
> > gcc-6.2.0       7059794     7651675 +8.4%
> > 
> > On a 4-sockets Intel-Skylake machine:
> > 
> >  compiler          base        head
> > gcc-4.4.7      13162890    13508193 +2.6%
> > gcc-4.9.4      14997463    15484353 +3.2%
> > gcc-5.4.1      14708711    15449805 +5.0%
> > gcc-6.2.0      14574099    15349204 +5.3%
> > 
> > The above 4 compilers are used becuase I've done the tests through Intel's
> > Linux Kernel Performance(LKP) infrastructure and they are the available
> > compilers there.
> > 
> > The benefit being less on 4 sockets machine is due to the lock contention
> > there(perf-profile/native_queued_spin_lock_slowpath=81%) is less severe
> > than on the 2 sockets machine(85%).
> > 
> > What the benchmark does is: it forks nr_cpu processes and then each
> > process does the following:
> >     1 mmap() 128M anonymous space;
> >     2 writes to each page there to trigger actual page allocation;
> >     3 munmap() it.
> > in a loop.
> > https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c
> 
> Are transparent hugepages enabled? If yes, __rmqueue() is called from
> rmqueue(), and there's only one page fault (and __rmqueue()) per 512
> "writes to each page". If not, __rmqueue() is called from rmqueue_bulk()
> in bursts once pcplists are depleted. I guess it's the latter, otherwise
> I wouldn't expect a function call to have such visible overhead.

THP is disabled. I should have mentioned this in the changelog, sorry
about that.

> 
> I guess what would help much more would be a bulk __rmqueue_smallest()
> to grab multiple pages from the freelists. But can't argue with your

Do I understand you correctly that you suggest to use a bulk
__rmqueue_smallest(), say __rmqueue_smallest_bulk(). With that, instead
of looping pcp->batch times in rmqueue_bulk(), a single call to
__rmqueue_smallest_bulk() is enough and __rmqueue_smallest_bulk() will
loop pcp->batch times to get those pages?

Then it feels like __rmqueue_smallest_bulk() has become rmqueue_bulk(),
or do I miss something?

> numbers against this patch.
> 
> > Binary size wise, I have locally built them with different compilers:
> > 
> > [aaron@aaronlu obj]$ size */*/mm/page_alloc.o
> >    text    data     bss     dec     hex filename
> >   37409    9904    8524   55837    da1d gcc-4.9.4/base/mm/page_alloc.o
> >   38273    9904    8524   56701    dd7d gcc-4.9.4/head/mm/page_alloc.o
> >   37465    9840    8428   55733    d9b5 gcc-5.5.0/base/mm/page_alloc.o
> >   38169    9840    8428   56437    dc75 gcc-5.5.0/head/mm/page_alloc.o
> >   37573    9840    8428   55841    da21 gcc-6.4.0/base/mm/page_alloc.o
> >   38261    9840    8428   56529    dcd1 gcc-6.4.0/head/mm/page_alloc.o
> >   36863    9840    8428   55131    d75b gcc-7.2.0/base/mm/page_alloc.o
> >   37711    9840    8428   55979    daab gcc-7.2.0/head/mm/page_alloc.o
> > 
> > Text size increased about 800 bytes for mm/page_alloc.o.
> 
> BTW, do you know about ./scripts/bloat-o-meter? :)

NO!!! Thanks for bringing this up :)

> With gcc 7.2.1:
> > ./scripts/bloat-o-meter base.o mm/page_alloc.o
> 
> add/remove: 1/2 grow/shrink: 2/0 up/down: 2493/-1649 (844)

Nice, it clearly showed 844 bytes bloat.

> function                                     old     new   delta
> get_page_from_freelist                      2898    4937   +2039
> steal_suitable_fallback                        -     365    +365
> find_suitable_fallback                        31     120     +89
> find_suitable_fallback.part                  115       -    -115
> __rmqueue                                   1534       -   -1534
> 
> 
> > [aaron@aaronlu obj]$ size */*/vmlinux
> >    text    data     bss     dec       hex     filename
> > 10342757   5903208 17723392 33969357  20654cd gcc-4.9.4/base/vmlinux
> > 10342757   5903208 17723392 33969357  20654cd gcc-4.9.4/head/vmlinux
> > 10332448   5836608 17715200 33884256  2050860 gcc-5.5.0/base/vmlinux
> > 10332448   5836608 17715200 33884256  2050860 gcc-5.5.0/head/vmlinux
> > 10094546   5836696 17715200 33646442  201676a gcc-6.4.0/base/vmlinux
> > 10094546   5836696 17715200 33646442  201676a gcc-6.4.0/head/vmlinux
> > 10018775   5828732 17715200 33562707  2002053 gcc-7.2.0/base/vmlinux
> > 10018775   5828732 17715200 33562707  2002053 gcc-7.2.0/head/vmlinux
> > 
> > Text size for vmlinux has no change though, probably due to function
> > alignment.
> 
> Yep that's useless to show. These differences do add up though, until
> they eventually cross the alignment boundary.

Agreed.
But you know, it is the hot path, the performance improvement might be
worth it.��.n������g����a����&ޖ)���)��h���&������梷�����Ǟ�m������)������^�����������v���O��zf������