Hi, Greg, good finding. See comments below. On 01.06.2020 06:22, Greg Thelen wrote: > Since v4.19 commit b0dedc49a2da ("mm/vmscan.c: iterate only over charged > shrinkers during memcg shrink_slab()") a memcg aware shrinker is only > called when the per-memcg per-node shrinker_map indicates that the > shrinker may have objects to release to the memcg and node. > > shmem_unused_huge_count and shmem_unused_huge_scan support the per-tmpfs > shrinker which advertises per memcg and numa awareness. The shmem > shrinker releases memory by splitting hugepages that extend beyond > i_size. > > Shmem does not currently set bits in shrinker_map. So, starting with > b0dedc49a2da, memcg reclaim avoids calling the shmem shrinker under > pressure. This leads to undeserved memcg OOM kills. > Example that reliably sees memcg OOM kill in unpatched kernel: > FS=/tmp/fs > CONTAINER=/cgroup/memory/tmpfs_shrinker > mkdir -p $FS > mount -t tmpfs -o huge=always nodev $FS > # Create 1000 MB container, which shouldn't suffer OOM. > mkdir $CONTAINER > echo 1000M > $CONTAINER/memory.limit_in_bytes > echo $BASHPID >> $CONTAINER/cgroup.procs > # Create 4000 files. Ideally each file uses 4k data page + a little > # metadata. Assume 8k total per-file, 32MB (4000*8k) should easily > # fit within container's 1000 MB. But if data pages use 2MB > # hugepages (due to aggressive huge=always) then files consume 8GB, > # which hits memcg 1000 MB limit. > for i in {1..4000}; do > echo . > $FS/$i > done > > v5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg > aware") maintains the per-node per-memcg shrinker bitmap for THP > shrinker. But there's no such logic in shmem. Make shmem set the > per-memcg per-node shrinker bits when it modifies inodes to have > shrinkable pages. > > Fixes: b0dedc49a2da ("mm/vmscan.c: iterate only over charged shrinkers during memcg shrink_slab()") > Cc: <stable@xxxxxxxxxxxxxxx> # 4.19+ > Signed-off-by: Greg Thelen <gthelen@xxxxxxxxxx> > --- > mm/shmem.c | 61 +++++++++++++++++++++++++++++++----------------------- > 1 file changed, 35 insertions(+), 26 deletions(-) > > diff --git a/mm/shmem.c b/mm/shmem.c > index bd8840082c94..e11090f78cb5 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -1002,6 +1002,33 @@ static int shmem_getattr(const struct path *path, struct kstat *stat, > return 0; > } > > +/* > + * Expose inode and optional page to shrinker as having a possibly splittable > + * hugepage that reaches beyond i_size. > + */ > +static void shmem_shrinker_add(struct shmem_sb_info *sbinfo, > + struct inode *inode, struct page *page) > +{ > + struct shmem_inode_info *info = SHMEM_I(inode); > + > + spin_lock(&sbinfo->shrinklist_lock); > + /* > + * _careful to defend against unlocked access to ->shrink_list in > + * shmem_unused_huge_shrink() > + */ > + if (list_empty_careful(&info->shrinklist)) { > + list_add_tail(&info->shrinklist, &sbinfo->shrinklist); > + sbinfo->shrinklist_len++; > + } > + spin_unlock(&sbinfo->shrinklist_lock); > + > +#ifdef CONFIG_MEMCG > + if (page && PageTransHuge(page)) > + memcg_set_shrinker_bit(page->mem_cgroup, page_to_nid(page), > + inode->i_sb->s_shrink.id); > +#endif > +} > + > static int shmem_setattr(struct dentry *dentry, struct iattr *attr) > { > struct inode *inode = d_inode(dentry); > @@ -1048,17 +1075,13 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr) > * to shrink under memory pressure. > */ > if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { > - spin_lock(&sbinfo->shrinklist_lock); > - /* > - * _careful to defend against unlocked access to > - * ->shrink_list in shmem_unused_huge_shrink() > - */ > - if (list_empty_careful(&info->shrinklist)) { > - list_add_tail(&info->shrinklist, > - &sbinfo->shrinklist); > - sbinfo->shrinklist_len++; > - } > - spin_unlock(&sbinfo->shrinklist_lock); > + struct page *page; > + > + page = find_get_page(inode->i_mapping, > + (newsize & HPAGE_PMD_MASK) >> PAGE_SHIFT); > + shmem_shrinker_add(sbinfo, inode, page); > + if (page) > + put_page(page); 1)I'd move PageTransHuge() check from shmem_shrinker_add() to here. In case of page is not trans huge, it looks strange and completely useless to add inode to shrinklist and then to avoid memcg_set_shrinker_bit(). Nothing should be added to the shrinklist in this case. 2)In general I think this "last inode page spliter" does not fit SHINKER_MEMCG_AWARE conception, and shmem_unused_huge_shrink() should be reworked as a new separate !memcg-aware shrinker instead of .nr_cached_objects callback of generic fs shrinker. CC: Kirill Shutemov Kirill, are there any fundamental reasons to keep this shrinking logic in the generic fs shrinker? Are there any no-go to for conversion this as a separate !memcg-aware shrinker? > } > } > } > @@ -1889,21 +1912,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index, > if (PageTransHuge(page) && > DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) < > hindex + HPAGE_PMD_NR - 1) { > - /* > - * Part of the huge page is beyond i_size: subject > - * to shrink under memory pressure. > - */ > - spin_lock(&sbinfo->shrinklist_lock); > - /* > - * _careful to defend against unlocked access to > - * ->shrink_list in shmem_unused_huge_shrink() > - */ > - if (list_empty_careful(&info->shrinklist)) { > - list_add_tail(&info->shrinklist, > - &sbinfo->shrinklist); > - sbinfo->shrinklist_len++; > - } > - spin_unlock(&sbinfo->shrinklist_lock); > + shmem_shrinker_add(sbinfo, inode, page); > } > > /* Thanks, Kirill