Re: [linux-next:master] [memcg] 01d37228d3: netperf.Throughput_Mbps 37.9% regression

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Mon, 10 Mar 2025 11:18:45 +0100

On Mon, Mar 10, 2025 at 10:55 AM Vlastimil Babka <vbabka@xxxxxxx> wrote:
>
> On 3/10/25 06:50, kernel test robot wrote:
> >
> >
> > Hello,
> >
> > kernel test robot noticed a 37.9% regression of netperf.Throughput_Mbps on:
>
> I assume this is some network receive context where gfpflags do not allow
> blocking.

gfpflags_allow_spinning() should be true for all current callers
including networking.

> > commit: 01d37228d331047a0bbbd1026cec2ccabef6d88d ("memcg: Use trylock to access memcg stock_lock.")
> > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> >
> > [test failed on linux-next/master 7ec162622e66a4ff886f8f28712ea1b13069e1aa]
> >
> > testcase: netperf
> > config: x86_64-rhel-9.4
> > compiler: gcc-12
> > test machine: 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory
> > parameters:
> >
> >       ip: ipv4
> >       runtime: 300s
> >       nr_threads: 50%
> >       cluster: cs-localhost
> >       test: TCP_MAERTS
> >       cpufreq_governor: performance
> >
> >
> > In addition to that, the commit also has significant impact on the following tests:
> >
> > +------------------+----------------------------------------------------------------------------------------------------+
> > | testcase: change | stress-ng: stress-ng.mmapfork.ops_per_sec  63.5% regression                                        |
>
> Hm interesting, this one at least from the name would be a GFP_KERNEL context?

weird indeed.

> > | test machine     | 256 threads 4 sockets INTEL(R) XEON(R) PLATINUM 8592+ (Emerald Rapids) with 256G memory            |
> > | test parameters  | cpufreq_governor=performance                                                                       |
> > |                  | nr_threads=100%                                                                                    |
> > |                  | test=mmapfork                                                                                      |
> > |                  | testtime=60s                                                                                       |
> > +------------------+----------------------------------------------------------------------------------------------------+
> > | testcase: change | hackbench: hackbench.throughput  26.6% regression                                                  |
> > | test machine     | 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory         |
> > | test parameters  | cpufreq_governor=performance                                                                       |
> > |                  | ipc=socket                                                                                         |
> > |                  | iterations=4                                                                                       |
> > |                  | mode=threads                                                                                       |
> > |                  | nr_threads=100%                                                                                    |
> > +------------------+----------------------------------------------------------------------------------------------------+
> > | testcase: change | lmbench3: lmbench3.TCP.socket.bandwidth.64B.MB/sec  33.0% regression                               |
> > | test machine     | 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 512G memory        |
> > | test parameters  | cpufreq_governor=performance                                                                       |
> > |                  | mode=development                                                                                   |
> > |                  | nr_threads=100%                                                                                    |
> > |                  | test=TCP                                                                                           |
> > |                  | test_memory_size=50%                                                                               |
> > +------------------+----------------------------------------------------------------------------------------------------+
> > | testcase: change | vm-scalability: vm-scalability.throughput  86.8% regression                                        |
> > | test machine     | 256 threads 4 sockets INTEL(R) XEON(R) PLATINUM 8592+ (Emerald Rapids) with 256G memory            |
> > | test parameters  | cpufreq_governor=performance                                                                       |
> > |                  | runtime=300s                                                                                       |
> > |                  | size=1T                                                                                            |
> > |                  | test=lru-shm                                                                                       |
> > +------------------+----------------------------------------------------------------------------------------------------+
> > | testcase: change | netperf: netperf.Throughput_Mbps 39.9% improvement                                                 |
>
> An improvement? Weird.

Even more weird and makes no sense to me so far.

>
> > | test machine     | 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory         |
> > | test parameters  | cluster=cs-localhost                                                                               |
> > |                  | cpufreq_governor=performance                                                                       |
> > |                  | ip=ipv4                                                                                            |
> > |                  | nr_threads=200%                                                                                    |
> > |                  | runtime=300s                                                                                       |
> > |                  | test=TCP_MAERTS                                                                                    |
> > +------------------+----------------------------------------------------------------------------------------------------+
> > | testcase: change | will-it-scale: will-it-scale.per_thread_ops  68.8% regression                                      |
> > | test machine     | 104 threads 2 sockets (Skylake) with 192G memory                                                   |
> > | test parameters  | cpufreq_governor=performance                                                                       |
> > |                  | mode=thread                                                                                        |
> > |                  | nr_task=100%                                                                                       |
> > |                  | test=fallocate1                                                                                    |
> > +------------------+----------------------------------------------------------------------------------------------------+
>
> Some of those as well.
>
> Anyway we should not be expecting the localtry_trylock_irqsave() itself be
> failing and resulting in a slow path, as that woulre require an allocation
> attempt from a nmi. So what else the commit does?
>
> >       0.10 ±  4%     +11.3       11.43 ±  3%  perf-profile.self.cycles-pp.try_charge_memcg
> >       0.00           +13.7       13.72 ±  2%  perf-profile.self.cycles-pp.page_counter_try_charge
>
> This does suggest more time spent in try_charge_memcg() because consume_stock() has failed.
>
> And I suspect this:
>
> +       if (!gfpflags_allow_spinning(gfp_mask))
> +               /* Avoid the refill and flush of the older stock */
> +               batch = nr_pages;
>
> because this will affect the refill even if consume_stock() fails not due to
> a trylock failure (which should not be happening), but also just because the
> stock was of a wrong memcg or depleted. So in the nowait context we deny the
> refill even if we have the memory. Attached patch could be used to see if it
> if fixes things. I'm not sure about the testcases where it doesn't look like
> nowait context would be used though, let's see.

Not quite.
GFP_NOWAIT includes __GFP_KSWAPD_RECLAIM,
so gfpflags_allow_spinning() will return true.

So 'batch' won't change.

> I've also found this:
> https://lore.kernel.org/all/7s6fbpwsynadnzybhdqg3jwhls4pq2sptyxuyghxpaufhissj5@iadb6ibzscjj/

Right. And notice Shakeel's suggestion doesn't include '!' in
the condition.
I assumed it's a typo.
Hence added it as "if (!gfpflags_allow_spinning(gfp_mask))"

> >
> > BTW after the done_restock tag in try_charge_memcg(), we will another
> > gfpflags_allow_spinning() check to avoid schedule_work() and
> > mem_cgroup_handle_over_high(). Maybe simply return early for
> > gfpflags_allow_spinning() without checking high marks.
>
> looks like a small possible optimization that was forgotten?
>  ----8<----
> From 29e7d18645577ce13d8a0140c0df050ce1ce0f95 Mon Sep 17 00:00:00 2001
> From: Vlastimil Babka <vbabka@xxxxxxx>
> Date: Mon, 10 Mar 2025 10:32:14 +0100
> Subject: [PATCH] memcg: Avoid stock refill only if stock_lock can't be
>  acquired
>
> Since commit 01d37228d331 ("memcg: Use trylock to access memcg
> stock_lock.") consume_stock() can fail if it can't obtain
> memcg_stock.stock_lock. In that case try_charge_memcg() also avoids
> refilling or flushing the stock when gfp flags indicate we are in the
> context where obtaining the lock could fail.
>
> However consume_stock() can also fail because the stock was depleted, or
> belonged to a different memcg. Avoiding the stock refill then reduces
> the caching efficiency, as the refill could still succeed with memory
> available. This has caused various regressions to be reported by the
> kernel test robot.
>
> To fix this, make the decision to avoid stock refill more precise by
> making consume_stock() return -EBUSY when it fails to obtain stock_lock,
> and using that for the no-refill decision.
>
> Fixes: 01d37228d331 ("memcg: Use trylock to access memcg stock_lock.")
> Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
> Closes: https://lore.kernel.org/oe-lkp/202503101254.cfd454df-lkp@xxxxxxxxx
>
> Signed-off-by: Vlastimil Babka <vbabka@xxxxxxx>
> ---
>  mm/memcontrol.c | 22 ++++++++++++++--------
>  1 file changed, 14 insertions(+), 8 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 092cab99dec7..a8371a22c7f4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1772,22 +1772,23 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>   * stock, and at least @nr_pages are available in that stock.  Failure to
>   * service an allocation will refill the stock.
>   *
> - * returns true if successful, false otherwise.
> + * returns 0 if successful, -EBUSY if lock cannot be acquired, or -ENOMEM
> + * if the memcg does not match or there are not enough pages
>   */
> -static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
> +static int consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
>                           gfp_t gfp_mask)
>  {
>         struct memcg_stock_pcp *stock;
>         unsigned int stock_pages;
>         unsigned long flags;
> -       bool ret = false;
> +       bool ret = -ENOMEM;
>
>         if (nr_pages > MEMCG_CHARGE_BATCH)
>                 return ret;
>
>         if (!localtry_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
>                 if (!gfpflags_allow_spinning(gfp_mask))
> -                       return ret;
> +                       return -EBUSY;
>                 localtry_lock_irqsave(&memcg_stock.stock_lock, flags);
>         }
>
> @@ -1795,7 +1796,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
>         stock_pages = READ_ONCE(stock->nr_pages);
>         if (memcg == READ_ONCE(stock->cached) && stock_pages >= nr_pages) {
>                 WRITE_ONCE(stock->nr_pages, stock_pages - nr_pages);
> -               ret = true;
> +               ret = 0;
>         }
>
>         localtry_unlock_irqrestore(&memcg_stock.stock_lock, flags);
> @@ -2228,13 +2229,18 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>         bool drained = false;
>         bool raised_max_event = false;
>         unsigned long pflags;
> +       int consume_ret;
>
>  retry:
> -       if (consume_stock(memcg, nr_pages, gfp_mask))
> +       consume_ret = consume_stock(memcg, nr_pages, gfp_mask);
> +       if (!consume_ret)
>                 return 0;
>
> -       if (!gfpflags_allow_spinning(gfp_mask))
> -               /* Avoid the refill and flush of the older stock */
> +       /*
> +        * Avoid the refill and flush of the older stock if we failed to acquire
> +        * the stock_lock
> +        */
> +       if (consume_ret == -EBUSY)
>                 batch = nr_pages;

Sure.
I think it's a good optimization, but I don't think it will make
any difference here.
Fixes tag is not appropriate and the commit log is off too.

I have strong suspicion that this bot report is bogus.

I'll try to repro anyway.