Re: [PATCH v3 5/9] mm/workingset: use the node counter if memcg is the root memcg

Joonsoo Kim <js1304@xxxxxxxxx> · Thu, 19 Mar 2020 15:20:54 +0900

2020년 3월 19일 (목) 오전 4:18, Johannes Weiner <hannes@xxxxxxxxxxx>님이 작성:
>
> On Tue, Mar 17, 2020 at 02:41:53PM +0900, js1304@xxxxxxxxx wrote:
> > From: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx>
> >
> > In the following patch, workingset detection is implemented for the
> > swap cache. Swap cache's node is usually allocated by kswapd and it
> > isn't charged by kmemcg since it is from the kernel thread. So the swap
> > cache's shadow node is managed by the node list of the list_lru rather
> > than the memcg specific one.
> >
> > If counting the shadow node on the root memcg happens to reclaim the slab
> > object, the shadow node count returns the number of the shadow node on
> > the node list of the list_lru since root memcg has the kmem_cache_id, -1.
> >
> > However, the size of pages on the LRU is calculated by using the specific
> > memcg, so mismatch happens. This causes the number of shadow node not to
> > be increased to the enough size and, therefore, workingset detection
> > cannot work correctly. This patch fixes this bug by checking if the memcg
> > is the root memcg or not. If it is the root memcg, instead of using
> > the memcg-specific LRU, the system-wide LRU is used to calculate proper
> > size of the shadow node so that the number of the shadow node can grow
> > as expected.
> >
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx>
> > ---
> >  mm/workingset.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/workingset.c b/mm/workingset.c
> > index 5fb8f85..a9f474a 100644
> > --- a/mm/workingset.c
> > +++ b/mm/workingset.c
> > @@ -468,7 +468,13 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
> >        * PAGE_SIZE / xa_nodes / node_entries * 8 / PAGE_SIZE
> >        */
> >  #ifdef CONFIG_MEMCG
> > -     if (sc->memcg) {
> > +     /*
> > +      * Kernel allocation on root memcg isn't regarded as allocation of
> > +      * specific memcg. So, if sc->memcg is the root memcg, we need to
> > +      * use the count for the node rather than one for the specific
> > +      * memcg.
> > +      */
> > +     if (sc->memcg && !mem_cgroup_is_root(sc->memcg)) {
>
> This is no good, unfortunately.
>
> It allows the root cgroup's shadows to grow way too large. Consider a
> large memory system where several workloads run in containers and only
> some host software runs in the root, yet that tiny root group will
> grow shadow entries in proportion to the entire RAM.

Okay.

> IMO, we have some choices here:
>
> 1. We say the swapcache is a shared system facility and its memory is
> not accounted to anyone. In that case, we should either
>    1a. Reclaim them to a fixed threshold, regardless of cgroup, or
>    1b. Not reclaim them at all. Or
> 2. We account all nodes to the groups for which they are allocated.
>    Essentially like this:
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 8e7ce9a9bc5e..d0d0dcc357fb 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -125,6 +125,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
>         page_ref_add(page, nr);
>         SetPageSwapCache(page);
>
> +       memalloc_use_memcg(page_memcg(page));
>         do {
>                 xas_lock_irq(&xas);
>                 xas_create_range(&xas);
> @@ -142,6 +143,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
>  unlock:
>                 xas_unlock_irq(&xas);
>         } while (xas_nomem(&xas, gfp));
> +       memalloc_unuse_memcg();
>
>         if (!xas_error(&xas))
>                 return 0;
> @@ -605,7 +607,8 @@ int init_swap_address_space(unsigned int type, unsigned long nr_pages)
>                 return -ENOMEM;
>         for (i = 0; i < nr; i++) {
>                 space = spaces + i;
> -               xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ);
> +               xa_init_flags(&space->i_pages,
> +                             XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT);
>                 atomic_set(&space->i_mmap_writable, 0);
>                 space->a_ops = &swap_aops;
>                 /* swap cache doesn't use writeback related tags */
>
> (A reclaimer has PF_MEMALLOC set, so we'll bypass the limit when
> recursing into charging the node.)
>
> I'm leaning more toward 1b, actually. The shadow shrinker was written
> because the combined address space of files on the filesystem can
> easily be in the terabytes, and practically unbounded with sparse
> files. The shadow shrinker is there to keep users from DoSing the
> system with shadow entries for files.
>
> However, the swap address space is bounded by a privileged user. And
> the size is usually in the GB range. On my system, radix_tree_node is
> ~583 bytes, so a for a 16G swapfile, the swapcache xarray should max
> out below 40M (36M worth of leaf nodes, plus some intermediate nodes).
>
> It doesn't seem worth messing with the shrinker at all for these.

40M / 16G, 0.25% of the amount of the used swap looks okay to me.
I will rework the patch on that way.

Thanks.