Re: + memcg-rearrange-fields-of-mem_cgroup_per_node.patch added to mm-unstable branch

Shakeel Butt <shakeel.butt@xxxxxxxxx> · Fri, 31 May 2024 15:54:28 -0700

Hi Andrew,

I couldn't find the following patch in the mm-tree yet. Does this patch
somehow fall through the cracks?

Shakeel

On Tue, May 28, 2024 at 12:00:57PM GMT, Andrew Morton wrote:
> 
> The patch titled
>      Subject: memcg: rearrange fields of mem_cgroup_per_node
> has been added to the -mm mm-unstable branch.  Its filename is
>      memcg-rearrange-fields-of-mem_cgroup_per_node.patch
> 
> This patch will shortly appear at
>      https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/memcg-rearrange-fields-of-mem_cgroup_per_node.patch
> 
> This patch will later appear in the mm-unstable branch at
>     git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> 
> Before you just go and hit "reply", please:
>    a) Consider who else should be cc'ed
>    b) Prefer to cc a suitable mailing list as well
>    c) Ideally: find the original patch on the mailing list and do a
>       reply-to-all to that, adding suitable additional cc's
> 
> *** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
> 
> The -mm tree is included into linux-next via the mm-everything
> branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> and is updated there every 2-3 working days
> 
> ------------------------------------------------------
> From: Shakeel Butt <shakeel.butt@xxxxxxxxx>
> Subject: memcg: rearrange fields of mem_cgroup_per_node
> Date: Tue, 28 May 2024 09:40:50 -0700
> 
> Kernel test robot reported [1] performance regression for will-it-scale
> test suite's page_fault2 test case for the commit 70a64b7919cb ("memcg:
> dynamically allocate lruvec_stats").  After inspection it seems like the
> commit has unintentionally introduced false cache sharing.
> 
> After the commit the fields of mem_cgroup_per_node which get read on the
> performance critical path share the cacheline with the fields which get
> updated often on LRU page allocations or deallocations.  This has caused
> contention on that cacheline and the workloads which manipulates a lot of
> LRU pages are regressed as reported by the test report.
> 
> The solution is to rearrange the fields of mem_cgroup_per_node such that
> the false sharing is eliminated.  Let's move all the read only pointers at
> the start of the struct, followed by memcg-v1 only fields and at the end
> fields which get updated often.
> 
> Experiment setup: Ran fallocate1, fallocate2, page_fault1, page_fault2 and
> page_fault3 from the will-it-scale test suite inside a three level memcg
> with /tmp mounted as tmpfs on two different machines, one a single numa
> node and the other one, two node machine.
> 
>  $ ./[testcase]_processes -t $NR_CPUS -s 50
> 
> Results for single node, 52 CPU machine:
> 
> Testcase        base        with-patch
> 
> fallocate1      1031081     1431291  (38.80 %)
> fallocate2      1029993     1421421  (38.00 %)
> page_fault1     2269440     3405788  (50.07 %)
> page_fault2     2375799     3572868  (50.30 %)
> page_fault3     28641143    28673950 ( 0.11 %)
> 
> Results for dual node, 80 CPU machine:
> 
> Testcase        base        with-patch
> 
> fallocate1      2976288     3641185  (22.33 %)
> fallocate2      2979366     3638181  (22.11 %)
> page_fault1     6221790     7748245  (24.53 %)
> page_fault2     6482854     7847698  (21.05 %)
> page_fault3     28804324    28991870 ( 0.65 %)
> 
> Link: https://lkml.kernel.org/r/20240528164050.2625718-1-shakeel.butt@xxxxxxxxx
> Fixes: 70a64b7919cb ("memcg: dynamically allocate lruvec_stats")
> Signed-off-by: Shakeel Butt <shakeel.butt@xxxxxxxxx>
> Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
> Reviewed-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> Reviewed-by: Roman Gushchin <roman.gushchin@xxxxxxxxx>
> Cc: Feng Tang <feng.tang@xxxxxxxxx>
> Cc: "Huang, Ying" <ying.huang@xxxxxxxxx>
> Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
> Cc: Michal Hocko <mhocko@xxxxxxxxxx>
> Cc: Muchun Song <muchun.song@xxxxxxxxx>
> Cc: Yin Fengwei <fengwei.yin@xxxxxxxxx>
> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> ---
> 
>  include/linux/memcontrol.h |   22 ++++++++++++++--------
>  1 file changed, 14 insertions(+), 8 deletions(-)
> 
> --- a/include/linux/memcontrol.h~memcg-rearrange-fields-of-mem_cgroup_per_node
> +++ a/include/linux/memcontrol.h
> @@ -96,23 +96,29 @@ struct mem_cgroup_reclaim_iter {
>   * per-node information in memory controller.
>   */
>  struct mem_cgroup_per_node {
> -	struct lruvec		lruvec;
> +	/* Keep the read-only fields at the start */
> +	struct mem_cgroup	*memcg;		/* Back pointer, we cannot */
> +						/* use container_of	   */
>  
>  	struct lruvec_stats_percpu __percpu	*lruvec_stats_percpu;
>  	struct lruvec_stats			*lruvec_stats;
> -
> -	unsigned long		lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];
> -
> -	struct mem_cgroup_reclaim_iter	iter;
> -
>  	struct shrinker_info __rcu	*shrinker_info;
>  
> +	/*
> +	 * Memcg-v1 only stuff in middle as buffer between read mostly fields
> +	 * and update often fields to avoid false sharing. Once v1 stuff is
> +	 * moved in a separate struct, an explicit padding is needed.
> +	 */
> +
>  	struct rb_node		tree_node;	/* RB tree node */
>  	unsigned long		usage_in_excess;/* Set to the value by which */
>  						/* the soft limit is exceeded*/
>  	bool			on_tree;
> -	struct mem_cgroup	*memcg;		/* Back pointer, we cannot */
> -						/* use container_of	   */
> +
> +	/* Fields which get updated often at the end. */
> +	struct lruvec		lruvec;
> +	unsigned long		lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];
> +	struct mem_cgroup_reclaim_iter	iter;
>  };
>  
>  struct mem_cgroup_threshold {
> _
> 
> Patches currently in -mm which might be from shakeel.butt@xxxxxxxxx are
> 
> memcg-rearrange-fields-of-mem_cgroup_per_node.patch
>