Re: [PATCH v2 3/5] mm: memcg: merge multiple page_counters into a single structure

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jul 25, 2024 at 05:42:27PM -0400, Johannes Weiner wrote:
> On Wed, Jul 24, 2024 at 08:21:01PM +0000, Roman Gushchin wrote:
> > --- a/include/linux/page_counter.h
> > +++ b/include/linux/page_counter.h
> > @@ -5,14 +5,71 @@
> >  #include <linux/atomic.h>
> >  #include <linux/cache.h>
> >  #include <linux/limits.h>
> > +#include <linux/mm_types.h>
> >  #include <asm/page.h>
> >  
> > +/*
> > + * Page counters are used by memory and hugetlb cgroups.
> > + * Memory cgroups are using up to 4 separate counters:
> > + * memory, swap (memory + swap on cgroup v1), kmem and tcpmem.
> > + * Hugetlb cgroups are using 2 arrays of counters with HUGE_MAX_HSTATE
> > + * in each to track the usage and reservations of each supported
> > + * hugepage size.
> > + *
> > + * Protection (min/low) is supported only for the first counter
> > + * with idx 0 and only if the counter was initialized with the protection
> > + * support.
> > + */
> > +
> > +enum mem_counter_type {
> > +#ifdef CONFIG_MEMCG
> > +	/* Unified memory counter */
> > +	MCT_MEM,
> > +
> > +	/* Swap */
> > +	MCT_SWAP,
> > +
> > +	/* Memory + swap */
> > +	MCT_MEMSW = MCT_SWAP,
> > +
> > +#ifdef CONFIG_MEMCG_V1
> > +	/* Kernel memory */
> > +	MCT_KMEM,
> > +
> > +	/* Tcp memory */
> > +	MCT_TCPMEM,
> > +#endif /* CONFIG_MEMCG_V1 */
> > +#endif /* CONFIG_MEMCG */
> > +
> > +	/* Maximum number of memcg counters */
> > +	MCT_NR_MEMCG_ITEMS,
> > +};
> > +
> > +#ifdef CONFIG_CGROUP_HUGETLB
> > +#ifdef HUGE_MAX_HSTATE
> > +#define MCT_NR_HUGETLB_ITEMS HUGE_MAX_HSTATE
> > +#else
> > +#define MCT_NR_HUGETLB_ITEMS 1
> > +#endif
> > +
> > +/*
> > + * max() can't be used here: even though __builtin_choose_expr() evaluates
> > + * to true, the false clause generates a compiler error:
> > + * error: braced-group within expression allowed only inside a function .
> > + */
> > +#define MCT_NR_ITEMS (__builtin_choose_expr(MCT_NR_MEMCG_ITEMS > MCT_NR_HUGETLB_ITEMS, \
> > +					    MCT_NR_MEMCG_ITEMS, MCT_NR_HUGETLB_ITEMS))
> > +
> > +#else /* CONFIG_CGROUP_HUGETLB */
> > +#define MCT_NR_ITEMS MCT_NR_MEMCG_ITEMS
> > +#endif /* CONFIG_CGROUP_HUGETLB */
> > +
> >  struct page_counter {
> >  	/*
> >  	 * Make sure 'usage' does not share cacheline with any other field. The
> >  	 * memcg->memory.usage is a hot member of struct mem_cgroup.
> >  	 */
> > -	atomic_long_t usage;
> > +	atomic_long_t usage[MCT_NR_ITEMS];
> >  	CACHELINE_PADDING(_pad1_);
> >  
> >  	/* effective memory.min and memory.min usage tracking */
> > @@ -25,9 +82,9 @@ struct page_counter {
> >  	atomic_long_t low_usage;
> >  	atomic_long_t children_low_usage;
> >  
> > -	unsigned long watermark;
> > -	unsigned long local_watermark; /* track min of fd-local resets */
> > -	unsigned long failcnt;
> > +	unsigned long watermark[MCT_NR_ITEMS];
> > +	unsigned long local_watermark[MCT_NR_ITEMS]; /* track min of fd-local resets */
> > +	unsigned long failcnt[MCT_NR_ITEMS];
> >  
> >  	/* Keep all the read most fields in a separete cacheline. */
> >  	CACHELINE_PADDING(_pad2_);
> > @@ -35,8 +92,9 @@ struct page_counter {
> >  	bool protection_support;
> >  	unsigned long min;
> >  	unsigned long low;
> > -	unsigned long high;
> > -	unsigned long max;
> > +	unsigned long high[MCT_NR_ITEMS];
> > +	unsigned long max[MCT_NR_ITEMS];
> > +
> >  	struct page_counter *parent;
> >  } ____cacheline_internodealigned_in_smp;
> 
> This hardcodes way too much user specifics into what should be a
> self-contained piece of abstraction. Should anybody else try to use
> the 'struct page_counter' for a single resource elsewhere, they'd end
> up with up to 5 counters, watermarks, failcnts, highs and maxs etc.
> 
> It's clear that the hierarchical aspect of the page_counter no longer
> works. It was okay-ish when all we had was a simple hard limit. Now we
> carry to much stuff in it that is only useful to a small set of users.
> Instead of doubling down and repeating the mistakes we made for struct
> page, it would be better to move user specific stuff out of it
> entirely. I think there are two patterns that would be more natural:
> a) pass a callback function and have a priv pointer in page_counter
> for user-specifics; or b) remove the hierarchy aspect from the page
> counter entirely, let the *caller* walk the tree, call the page
> counter code to trycharge (and log watermark) only that level, and
> handle everything caller specific on the caller side.
>
> All users already have parent linkage in the css, so this would
> eliminate the parent pointer in page_counter altogether.
> 
> The protection stuff could be moved to memcg proper, eliminating yet
> more fields that aren't interesting to any other user.

Hm, Idk, I do agree with what you're saying about the self-contained
piece of abstraction (and I had very similar thoughts in the process),
but there are also some complications.

First, funny enough, the protection calculation code was just moved to
mm/page_counter.c by a8585ac68621 ("mm/page_counter: move calculating protection
values to page_counter"). The commit log says that it's gonna be used by the drm
cgroup controller, but the code is not (yet?) upstream apparently. I don't have
any insights here. If there will be the second user for the protection
functionality, moving it back to memcontrol.c is not feasible.

Second, I agree that it would be nice to get rid of the parent pointer in
struct page_cgroup entirely and use one in css. But Idk how to do it
without making the code way more messy or duplicate a lot of tree walks.
In C++ (or another language with generics) we could make struct page_counter
taking the number of counters and the set of features as template parameters.

I feel like with memcg1 code being factored out the benefit of this reorg
lowered, so how about me resending 2 first patches separately and spending
more time on thinking on the best long-term strategy? I have some ideas
here and I thought about this work as a preparatory step, but maybe you're
right and it makes sense to approach it more comprehensively.

Thanks!




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux