2020년 3월 19일 (목) 오전 4:18, Johannes Weiner <hannes@xxxxxxxxxxx>님이 작성: > > On Tue, Mar 17, 2020 at 02:41:53PM +0900, js1304@xxxxxxxxx wrote: > > From: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx> > > > > In the following patch, workingset detection is implemented for the > > swap cache. Swap cache's node is usually allocated by kswapd and it > > isn't charged by kmemcg since it is from the kernel thread. So the swap > > cache's shadow node is managed by the node list of the list_lru rather > > than the memcg specific one. > > > > If counting the shadow node on the root memcg happens to reclaim the slab > > object, the shadow node count returns the number of the shadow node on > > the node list of the list_lru since root memcg has the kmem_cache_id, -1. > > > > However, the size of pages on the LRU is calculated by using the specific > > memcg, so mismatch happens. This causes the number of shadow node not to > > be increased to the enough size and, therefore, workingset detection > > cannot work correctly. This patch fixes this bug by checking if the memcg > > is the root memcg or not. If it is the root memcg, instead of using > > the memcg-specific LRU, the system-wide LRU is used to calculate proper > > size of the shadow node so that the number of the shadow node can grow > > as expected. > > > > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx> > > --- > > mm/workingset.c | 8 +++++++- > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > diff --git a/mm/workingset.c b/mm/workingset.c > > index 5fb8f85..a9f474a 100644 > > --- a/mm/workingset.c > > +++ b/mm/workingset.c > > @@ -468,7 +468,13 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker, > > * PAGE_SIZE / xa_nodes / node_entries * 8 / PAGE_SIZE > > */ > > #ifdef CONFIG_MEMCG > > - if (sc->memcg) { > > + /* > > + * Kernel allocation on root memcg isn't regarded as allocation of > > + * specific memcg. So, if sc->memcg is the root memcg, we need to > > + * use the count for the node rather than one for the specific > > + * memcg. > > + */ > > + if (sc->memcg && !mem_cgroup_is_root(sc->memcg)) { > > This is no good, unfortunately. > > It allows the root cgroup's shadows to grow way too large. Consider a > large memory system where several workloads run in containers and only > some host software runs in the root, yet that tiny root group will > grow shadow entries in proportion to the entire RAM. Okay. > IMO, we have some choices here: > > 1. We say the swapcache is a shared system facility and its memory is > not accounted to anyone. In that case, we should either > 1a. Reclaim them to a fixed threshold, regardless of cgroup, or > 1b. Not reclaim them at all. Or > 2. We account all nodes to the groups for which they are allocated. > Essentially like this: > > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 8e7ce9a9bc5e..d0d0dcc357fb 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -125,6 +125,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp) > page_ref_add(page, nr); > SetPageSwapCache(page); > > + memalloc_use_memcg(page_memcg(page)); > do { > xas_lock_irq(&xas); > xas_create_range(&xas); > @@ -142,6 +143,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp) > unlock: > xas_unlock_irq(&xas); > } while (xas_nomem(&xas, gfp)); > + memalloc_unuse_memcg(); > > if (!xas_error(&xas)) > return 0; > @@ -605,7 +607,8 @@ int init_swap_address_space(unsigned int type, unsigned long nr_pages) > return -ENOMEM; > for (i = 0; i < nr; i++) { > space = spaces + i; > - xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ); > + xa_init_flags(&space->i_pages, > + XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT); > atomic_set(&space->i_mmap_writable, 0); > space->a_ops = &swap_aops; > /* swap cache doesn't use writeback related tags */ > > (A reclaimer has PF_MEMALLOC set, so we'll bypass the limit when > recursing into charging the node.) > > I'm leaning more toward 1b, actually. The shadow shrinker was written > because the combined address space of files on the filesystem can > easily be in the terabytes, and practically unbounded with sparse > files. The shadow shrinker is there to keep users from DoSing the > system with shadow entries for files. > > However, the swap address space is bounded by a privileged user. And > the size is usually in the GB range. On my system, radix_tree_node is > ~583 bytes, so a for a 16G swapfile, the swapcache xarray should max > out below 40M (36M worth of leaf nodes, plus some intermediate nodes). > > It doesn't seem worth messing with the shrinker at all for these. 40M / 16G, 0.25% of the amount of the used swap looks okay to me. I will rework the patch on that way. Thanks.