Re: [PATCH] mm, slub: prefetch freelist in ___slab_alloc()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




在 2024/8/19 17:33, Hyeonggon Yoo 写道:
On Mon, Aug 19, 2024 at 4:02 PM Yongqiang Liu <liuyongqiang13@xxxxxxxxxx> wrote:
commit 0ad9500e16fe ("slub: prefetch next freelist pointer in
slab_alloc()") introduced prefetch_freepointer() for fastpath
allocation. Use it at the freelist firt load could have a bit
improvement in some workloads. Here is hackbench results at
arm64 machine(about 3.8%):

Before:
   average time cost of 'hackbench -g 100 -l 1000': 17.068

Afther:
   average time cost of 'hackbench -g 100 -l 1000': 16.416

There is also having about 5% improvement at x86_64 machine
for hackbench.
I think adding more prefetch might not be a good idea unless we have
more real-world data supporting it because prefetch might help when slab
is frequently used, but it will end up unnecessarily using more cache
lines when slab is not frequently used.

Yes, prefetching unnecessary objects is a bad idea. But I think the slab entered

in slowpath that means it will more likely need more objects. I've tested the

cases from commit 0ad9500e16fe ("slub: prefetch next freelist pointer in
slab_alloc()"). Here is the result:

Before:

Performance counter stats for './hackbench 50 process 4000' (32 runs):

                2545.28 msec task-clock                #    6.938 CPUs utilized        ( +-  1.75% )                      6166     context-switches          #    0.002 M/sec                    ( +-  1.58% )                     1129      cpu-migrations            #    0.444 K/sec                     ( +-  2.16% )                   13298      page-faults                  # 0.005 M/sec                    ( +-  0.38% )         4435113150      cycles                           # 1.742 GHz                         ( +-  1.22% )         2259717630      instructions                 #    0.51 insn per cycle           ( +-  0.05% )           385847392      branches                     #  151.593 M/sec                    ( +-  0.06% )              6205369       branch-misses            #    1.61% of all branches       ( +-  0.56% )

           0.36688 +- 0.00595 seconds time elapsed  ( +-  1.62% )
After:

 Performance counter stats for './hackbench 50 process 4000' (32 runs):

               2277.61 msec task-clock                #    6.855 CPUs utilized            ( +-  0.98% )                     5653      context-switches         #    0.002 M/sec                       ( +-  1.62% )                     1081      cpu-migrations           #    0.475 K/sec                        ( +-  1.89% )                   13217      page-faults                 # 0.006 M/sec                       ( +-  0.48% )         3751509945      cycles                          #    1.647 GHz                          ( +-  1.14% )         2253177626      instructions                #    0.60 insn per cycle             ( +-  0.06% )           384509166      branches                    #    168.821 M/sec                    ( +-  0.07% )               6045031      branch-misses           #    1.57% of all branches          ( +-  0.58% )

           0.33225 +- 0.00321 seconds time elapsed  ( +-  0.97% )


Also I don't understand how adding prefetch in slowpath affects the performance
because most allocs/frees should be done in the fastpath. Could you
please explain?

By adding some debug info to count the slowpath for the hackbench:

'hackbench -g 100 -l 1000' slab alloc total: 80416886, and the slowpath: 7184236.

About 9% slowpath in total allocation. The perf stats in arm64 as follow:

Before:
 Performance counter stats for './hackbench -g 100 -l 1000' (32 runs):

       34766611220 branches                      ( +-  0.01% )
           382593804      branch-misses                  # 1.10% of all branches          ( +-  0.14% )
         1120091414 cache-misses                 ( +-  0.08% )
       76810485402 L1-dcache-loads               ( +-  0.03% )
         1120091414      L1-dcache-load-misses     #    1.46% of all L1-dcache hits    ( +-  0.08% )

           23.8854 +- 0.0804 seconds time elapsed  ( +-  0.34% )

After:
 Performance counter stats for './hackbench -g 100 -l 1000' (32 runs):

       34812735277 branches                  ( +-  0.01% )
           393449644      branch-misses             #    1.13% of all branches           ( +-  0.15% )
         1095185949 cache-misses             ( +-  0.15% )
       76995789602 L1-dcache-loads             ( +-  0.03% )
         1095185949      L1-dcache-load-misses     #    1.42% of all L1-dcache hits    ( +-  0.15% )

            23.341 +- 0.104 seconds time elapsed  ( +-  0.45% )

It seems having less L1-dcache-load-misses.


Signed-off-by: Yongqiang Liu <liuyongqiang13@xxxxxxxxxx>
---
  mm/slub.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/mm/slub.c b/mm/slub.c
index c9d8a2497fd6..f9daaff10c6a 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3630,6 +3630,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
         VM_BUG_ON(!c->slab->frozen);
         c->freelist = get_freepointer(s, freelist);
         c->tid = next_tid(c->tid);
+       prefetch_freepointer(s, c->freelist);
         local_unlock_irqrestore(&s->cpu_slab->lock, flags);
         return freelist;

--
2.25.1





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux