On Tue, Mar 17, 2020 at 02:41:53PM +0900, js1304@xxxxxxxxx wrote: > From: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx> > > In the following patch, workingset detection is implemented for the > swap cache. Swap cache's node is usually allocated by kswapd and it > isn't charged by kmemcg since it is from the kernel thread. So the swap > cache's shadow node is managed by the node list of the list_lru rather > than the memcg specific one. > > If counting the shadow node on the root memcg happens to reclaim the slab > object, the shadow node count returns the number of the shadow node on > the node list of the list_lru since root memcg has the kmem_cache_id, -1. > > However, the size of pages on the LRU is calculated by using the specific > memcg, so mismatch happens. This causes the number of shadow node not to > be increased to the enough size and, therefore, workingset detection > cannot work correctly. This patch fixes this bug by checking if the memcg > is the root memcg or not. If it is the root memcg, instead of using > the memcg-specific LRU, the system-wide LRU is used to calculate proper > size of the shadow node so that the number of the shadow node can grow > as expected. > > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx> > --- > mm/workingset.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/mm/workingset.c b/mm/workingset.c > index 5fb8f85..a9f474a 100644 > --- a/mm/workingset.c > +++ b/mm/workingset.c > @@ -468,7 +468,13 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker, > * PAGE_SIZE / xa_nodes / node_entries * 8 / PAGE_SIZE > */ > #ifdef CONFIG_MEMCG > - if (sc->memcg) { > + /* > + * Kernel allocation on root memcg isn't regarded as allocation of > + * specific memcg. So, if sc->memcg is the root memcg, we need to > + * use the count for the node rather than one for the specific > + * memcg. > + */ > + if (sc->memcg && !mem_cgroup_is_root(sc->memcg)) { This is no good, unfortunately. It allows the root cgroup's shadows to grow way too large. Consider a large memory system where several workloads run in containers and only some host software runs in the root, yet that tiny root group will grow shadow entries in proportion to the entire RAM. IMO, we have some choices here: 1. We say the swapcache is a shared system facility and its memory is not accounted to anyone. In that case, we should either 1a. Reclaim them to a fixed threshold, regardless of cgroup, or 1b. Not reclaim them at all. Or 2. We account all nodes to the groups for which they are allocated. Essentially like this: diff --git a/mm/swap_state.c b/mm/swap_state.c index 8e7ce9a9bc5e..d0d0dcc357fb 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -125,6 +125,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp) page_ref_add(page, nr); SetPageSwapCache(page); + memalloc_use_memcg(page_memcg(page)); do { xas_lock_irq(&xas); xas_create_range(&xas); @@ -142,6 +143,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp) unlock: xas_unlock_irq(&xas); } while (xas_nomem(&xas, gfp)); + memalloc_unuse_memcg(); if (!xas_error(&xas)) return 0; @@ -605,7 +607,8 @@ int init_swap_address_space(unsigned int type, unsigned long nr_pages) return -ENOMEM; for (i = 0; i < nr; i++) { space = spaces + i; - xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ); + xa_init_flags(&space->i_pages, + XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT); atomic_set(&space->i_mmap_writable, 0); space->a_ops = &swap_aops; /* swap cache doesn't use writeback related tags */ (A reclaimer has PF_MEMALLOC set, so we'll bypass the limit when recursing into charging the node.) I'm leaning more toward 1b, actually. The shadow shrinker was written because the combined address space of files on the filesystem can easily be in the terabytes, and practically unbounded with sparse files. The shadow shrinker is there to keep users from DoSing the system with shadow entries for files. However, the swap address space is bounded by a privileged user. And the size is usually in the GB range. On my system, radix_tree_node is ~583 bytes, so a for a 16G swapfile, the swapcache xarray should max out below 40M (36M worth of leaf nodes, plus some intermediate nodes). It doesn't seem worth messing with the shrinker at all for these.