On Mon, Mar 10, 2025 at 10:55 AM Vlastimil Babka <vbabka@xxxxxxx> wrote: > > On 3/10/25 06:50, kernel test robot wrote: > > > > > > Hello, > > > > kernel test robot noticed a 37.9% regression of netperf.Throughput_Mbps on: > > I assume this is some network receive context where gfpflags do not allow > blocking. gfpflags_allow_spinning() should be true for all current callers including networking. > > commit: 01d37228d331047a0bbbd1026cec2ccabef6d88d ("memcg: Use trylock to access memcg stock_lock.") > > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master > > > > [test failed on linux-next/master 7ec162622e66a4ff886f8f28712ea1b13069e1aa] > > > > testcase: netperf > > config: x86_64-rhel-9.4 > > compiler: gcc-12 > > test machine: 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory > > parameters: > > > > ip: ipv4 > > runtime: 300s > > nr_threads: 50% > > cluster: cs-localhost > > test: TCP_MAERTS > > cpufreq_governor: performance > > > > > > In addition to that, the commit also has significant impact on the following tests: > > > > +------------------+----------------------------------------------------------------------------------------------------+ > > | testcase: change | stress-ng: stress-ng.mmapfork.ops_per_sec 63.5% regression | > > Hm interesting, this one at least from the name would be a GFP_KERNEL context? weird indeed. > > | test machine | 256 threads 4 sockets INTEL(R) XEON(R) PLATINUM 8592+ (Emerald Rapids) with 256G memory | > > | test parameters | cpufreq_governor=performance | > > | | nr_threads=100% | > > | | test=mmapfork | > > | | testtime=60s | > > +------------------+----------------------------------------------------------------------------------------------------+ > > | testcase: change | hackbench: hackbench.throughput 26.6% regression | > > | test machine | 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory | > > | test parameters | cpufreq_governor=performance | > > | | ipc=socket | > > | | iterations=4 | > > | | mode=threads | > > | | nr_threads=100% | > > +------------------+----------------------------------------------------------------------------------------------------+ > > | testcase: change | lmbench3: lmbench3.TCP.socket.bandwidth.64B.MB/sec 33.0% regression | > > | test machine | 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 512G memory | > > | test parameters | cpufreq_governor=performance | > > | | mode=development | > > | | nr_threads=100% | > > | | test=TCP | > > | | test_memory_size=50% | > > +------------------+----------------------------------------------------------------------------------------------------+ > > | testcase: change | vm-scalability: vm-scalability.throughput 86.8% regression | > > | test machine | 256 threads 4 sockets INTEL(R) XEON(R) PLATINUM 8592+ (Emerald Rapids) with 256G memory | > > | test parameters | cpufreq_governor=performance | > > | | runtime=300s | > > | | size=1T | > > | | test=lru-shm | > > +------------------+----------------------------------------------------------------------------------------------------+ > > | testcase: change | netperf: netperf.Throughput_Mbps 39.9% improvement | > > An improvement? Weird. Even more weird and makes no sense to me so far. > > > | test machine | 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory | > > | test parameters | cluster=cs-localhost | > > | | cpufreq_governor=performance | > > | | ip=ipv4 | > > | | nr_threads=200% | > > | | runtime=300s | > > | | test=TCP_MAERTS | > > +------------------+----------------------------------------------------------------------------------------------------+ > > | testcase: change | will-it-scale: will-it-scale.per_thread_ops 68.8% regression | > > | test machine | 104 threads 2 sockets (Skylake) with 192G memory | > > | test parameters | cpufreq_governor=performance | > > | | mode=thread | > > | | nr_task=100% | > > | | test=fallocate1 | > > +------------------+----------------------------------------------------------------------------------------------------+ > > Some of those as well. > > Anyway we should not be expecting the localtry_trylock_irqsave() itself be > failing and resulting in a slow path, as that woulre require an allocation > attempt from a nmi. So what else the commit does? > > > 0.10 ± 4% +11.3 11.43 ± 3% perf-profile.self.cycles-pp.try_charge_memcg > > 0.00 +13.7 13.72 ± 2% perf-profile.self.cycles-pp.page_counter_try_charge > > This does suggest more time spent in try_charge_memcg() because consume_stock() has failed. > > And I suspect this: > > + if (!gfpflags_allow_spinning(gfp_mask)) > + /* Avoid the refill and flush of the older stock */ > + batch = nr_pages; > > because this will affect the refill even if consume_stock() fails not due to > a trylock failure (which should not be happening), but also just because the > stock was of a wrong memcg or depleted. So in the nowait context we deny the > refill even if we have the memory. Attached patch could be used to see if it > if fixes things. I'm not sure about the testcases where it doesn't look like > nowait context would be used though, let's see. Not quite. GFP_NOWAIT includes __GFP_KSWAPD_RECLAIM, so gfpflags_allow_spinning() will return true. So 'batch' won't change. > I've also found this: > https://lore.kernel.org/all/7s6fbpwsynadnzybhdqg3jwhls4pq2sptyxuyghxpaufhissj5@iadb6ibzscjj/ Right. And notice Shakeel's suggestion doesn't include '!' in the condition. I assumed it's a typo. Hence added it as "if (!gfpflags_allow_spinning(gfp_mask))" > > > > BTW after the done_restock tag in try_charge_memcg(), we will another > > gfpflags_allow_spinning() check to avoid schedule_work() and > > mem_cgroup_handle_over_high(). Maybe simply return early for > > gfpflags_allow_spinning() without checking high marks. > > looks like a small possible optimization that was forgotten? > ----8<---- > From 29e7d18645577ce13d8a0140c0df050ce1ce0f95 Mon Sep 17 00:00:00 2001 > From: Vlastimil Babka <vbabka@xxxxxxx> > Date: Mon, 10 Mar 2025 10:32:14 +0100 > Subject: [PATCH] memcg: Avoid stock refill only if stock_lock can't be > acquired > > Since commit 01d37228d331 ("memcg: Use trylock to access memcg > stock_lock.") consume_stock() can fail if it can't obtain > memcg_stock.stock_lock. In that case try_charge_memcg() also avoids > refilling or flushing the stock when gfp flags indicate we are in the > context where obtaining the lock could fail. > > However consume_stock() can also fail because the stock was depleted, or > belonged to a different memcg. Avoiding the stock refill then reduces > the caching efficiency, as the refill could still succeed with memory > available. This has caused various regressions to be reported by the > kernel test robot. > > To fix this, make the decision to avoid stock refill more precise by > making consume_stock() return -EBUSY when it fails to obtain stock_lock, > and using that for the no-refill decision. > > Fixes: 01d37228d331 ("memcg: Use trylock to access memcg stock_lock.") > Reported-by: kernel test robot <oliver.sang@xxxxxxxxx> > Closes: https://lore.kernel.org/oe-lkp/202503101254.cfd454df-lkp@xxxxxxxxx > > Signed-off-by: Vlastimil Babka <vbabka@xxxxxxx> > --- > mm/memcontrol.c | 22 ++++++++++++++-------- > 1 file changed, 14 insertions(+), 8 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 092cab99dec7..a8371a22c7f4 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1772,22 +1772,23 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, > * stock, and at least @nr_pages are available in that stock. Failure to > * service an allocation will refill the stock. > * > - * returns true if successful, false otherwise. > + * returns 0 if successful, -EBUSY if lock cannot be acquired, or -ENOMEM > + * if the memcg does not match or there are not enough pages > */ > -static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, > +static int consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, > gfp_t gfp_mask) > { > struct memcg_stock_pcp *stock; > unsigned int stock_pages; > unsigned long flags; > - bool ret = false; > + bool ret = -ENOMEM; > > if (nr_pages > MEMCG_CHARGE_BATCH) > return ret; > > if (!localtry_trylock_irqsave(&memcg_stock.stock_lock, flags)) { > if (!gfpflags_allow_spinning(gfp_mask)) > - return ret; > + return -EBUSY; > localtry_lock_irqsave(&memcg_stock.stock_lock, flags); > } > > @@ -1795,7 +1796,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, > stock_pages = READ_ONCE(stock->nr_pages); > if (memcg == READ_ONCE(stock->cached) && stock_pages >= nr_pages) { > WRITE_ONCE(stock->nr_pages, stock_pages - nr_pages); > - ret = true; > + ret = 0; > } > > localtry_unlock_irqrestore(&memcg_stock.stock_lock, flags); > @@ -2228,13 +2229,18 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, > bool drained = false; > bool raised_max_event = false; > unsigned long pflags; > + int consume_ret; > > retry: > - if (consume_stock(memcg, nr_pages, gfp_mask)) > + consume_ret = consume_stock(memcg, nr_pages, gfp_mask); > + if (!consume_ret) > return 0; > > - if (!gfpflags_allow_spinning(gfp_mask)) > - /* Avoid the refill and flush of the older stock */ > + /* > + * Avoid the refill and flush of the older stock if we failed to acquire > + * the stock_lock > + */ > + if (consume_ret == -EBUSY) > batch = nr_pages; Sure. I think it's a good optimization, but I don't think it will make any difference here. Fixes tag is not appropriate and the commit log is off too. I have strong suspicion that this bot report is bogus. I'll try to repro anyway.