Re: [linux-next:master] [memcg] 01d37228d3: netperf.Throughput_Mbps 37.9% regression

Vlastimil Babka <vbabka@xxxxxxx> · Mon, 10 Mar 2025 10:55:51 +0100

On 3/10/25 06:50, kernel test robot wrote:
> 
> 
> Hello,
> 
> kernel test robot noticed a 37.9% regression of netperf.Throughput_Mbps on:

I assume this is some network receive context where gfpflags do not allow
blocking.

> commit: 01d37228d331047a0bbbd1026cec2ccabef6d88d ("memcg: Use trylock to access memcg stock_lock.")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> 
> [test failed on linux-next/master 7ec162622e66a4ff886f8f28712ea1b13069e1aa]
> 
> testcase: netperf
> config: x86_64-rhel-9.4
> compiler: gcc-12
> test machine: 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory
> parameters:
> 
> 	ip: ipv4
> 	runtime: 300s
> 	nr_threads: 50%
> 	cluster: cs-localhost
> 	test: TCP_MAERTS
> 	cpufreq_governor: performance
> 
> 
> In addition to that, the commit also has significant impact on the following tests:
> 
> +------------------+----------------------------------------------------------------------------------------------------+
> | testcase: change | stress-ng: stress-ng.mmapfork.ops_per_sec  63.5% regression                                        |

Hm interesting, this one at least from the name would be a GFP_KERNEL context?

> | test machine     | 256 threads 4 sockets INTEL(R) XEON(R) PLATINUM 8592+ (Emerald Rapids) with 256G memory            |
> | test parameters  | cpufreq_governor=performance                                                                       |
> |                  | nr_threads=100%                                                                                    |
> |                  | test=mmapfork                                                                                      |
> |                  | testtime=60s                                                                                       |
> +------------------+----------------------------------------------------------------------------------------------------+
> | testcase: change | hackbench: hackbench.throughput  26.6% regression                                                  |
> | test machine     | 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory         |
> | test parameters  | cpufreq_governor=performance                                                                       |
> |                  | ipc=socket                                                                                         |
> |                  | iterations=4                                                                                       |
> |                  | mode=threads                                                                                       |
> |                  | nr_threads=100%                                                                                    |
> +------------------+----------------------------------------------------------------------------------------------------+
> | testcase: change | lmbench3: lmbench3.TCP.socket.bandwidth.64B.MB/sec  33.0% regression                               |
> | test machine     | 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 512G memory        |
> | test parameters  | cpufreq_governor=performance                                                                       |
> |                  | mode=development                                                                                   |
> |                  | nr_threads=100%                                                                                    |
> |                  | test=TCP                                                                                           |
> |                  | test_memory_size=50%                                                                               |
> +------------------+----------------------------------------------------------------------------------------------------+
> | testcase: change | vm-scalability: vm-scalability.throughput  86.8% regression                                        |
> | test machine     | 256 threads 4 sockets INTEL(R) XEON(R) PLATINUM 8592+ (Emerald Rapids) with 256G memory            |
> | test parameters  | cpufreq_governor=performance                                                                       |
> |                  | runtime=300s                                                                                       |
> |                  | size=1T                                                                                            |
> |                  | test=lru-shm                                                                                       |
> +------------------+----------------------------------------------------------------------------------------------------+
> | testcase: change | netperf: netperf.Throughput_Mbps 39.9% improvement                                                 |

An improvement? Weird.

> | test machine     | 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory         |
> | test parameters  | cluster=cs-localhost                                                                               |
> |                  | cpufreq_governor=performance                                                                       |
> |                  | ip=ipv4                                                                                            |
> |                  | nr_threads=200%                                                                                    |
> |                  | runtime=300s                                                                                       |
> |                  | test=TCP_MAERTS                                                                                    |
> +------------------+----------------------------------------------------------------------------------------------------+
> | testcase: change | will-it-scale: will-it-scale.per_thread_ops  68.8% regression                                      |
> | test machine     | 104 threads 2 sockets (Skylake) with 192G memory                                                   |
> | test parameters  | cpufreq_governor=performance                                                                       |
> |                  | mode=thread                                                                                        |
> |                  | nr_task=100%                                                                                       |
> |                  | test=fallocate1                                                                                    |
> +------------------+----------------------------------------------------------------------------------------------------+

Some of those as well.

Anyway we should not be expecting the localtry_trylock_irqsave() itself be
failing and resulting in a slow path, as that woulre require an allocation
attempt from a nmi. So what else the commit does?

>       0.10 ±  4%     +11.3       11.43 ±  3%  perf-profile.self.cycles-pp.try_charge_memcg
>       0.00           +13.7       13.72 ±  2%  perf-profile.self.cycles-pp.page_counter_try_charge

This does suggest more time spent in try_charge_memcg() because consume_stock() has failed.

And I suspect this:

+       if (!gfpflags_allow_spinning(gfp_mask))
+               /* Avoid the refill and flush of the older stock */
+               batch = nr_pages;

because this will affect the refill even if consume_stock() fails not due to
a trylock failure (which should not be happening), but also just because the
stock was of a wrong memcg or depleted. So in the nowait context we deny the
refill even if we have the memory. Attached patch could be used to see if it
if fixes things. I'm not sure about the testcases where it doesn't look like
nowait context would be used though, let's see.

I've also found this:
https://lore.kernel.org/all/7s6fbpwsynadnzybhdqg3jwhls4pq2sptyxuyghxpaufhissj5@iadb6ibzscjj/

> 
> BTW after the done_restock tag in try_charge_memcg(), we will another
> gfpflags_allow_spinning() check to avoid schedule_work() and
> mem_cgroup_handle_over_high(). Maybe simply return early for
> gfpflags_allow_spinning() without checking high marks.

looks like a small possible optimization that was forgotten?
 ----8<----