On Sun, Jan 06, 2019 at 10:08:52AM +0800, Fam Zheng wrote: > > > > On Jan 6, 2019, at 05:09, Roman Gushchin <guro@xxxxxx> wrote: > > > > On Fri, Jan 04, 2019 at 12:43:40PM +0800, Fam Zheng wrote: > >> Hi, > >> > >> In our server which frequently spawns containers, we find that if a process used pagecache in memory cgroup, after the process exits and memory cgroup is offlined, because the pagecache is still charged in this memory cgroup, this memory cgroup will not be destroyed until the pagecaches are dropped. This brings huge memory stress over time. We find that over one hundred thounsand such offlined memory cgroup in system hold too much memory (~100G). This memory can not be released immediately even after all associated pagecahes are released, because those memory cgroups are destroy asynchronously by a kworker. In some cases this can cause oom, since the synchronous memory allocation failed. > >> > >> We think a fix is to create a kworker that scans all pagecaches and dentry caches etc. in the background, if a referenced memory cgroup is offline, try to drop the cache or move it to the parent cgroup. This kworker can wake up periodically, or upon memory cgroup offline event (or both). > >> > >> There is a similar problem in inode. After digging in ext4 code, we find that when creating inode cache, SLAB_ACCOUNT is used. In this case, inode will alloc in slab which belongs to the current memory cgroup. After this memory cgroup goes offline, this inode may be held by a dentry cache. If another process uses the same file. this inode will be held by that process, preventing the previous memory cgroup from being destroyed until this other process closes the file and drops the dentry cache. > >> > >> We still don't have a reasonable way to fix this. > >> > >> Ideas? > > > > Hi, Fam! > > Hi! > > > > > Which kernel version you're on? > > We’ve seen the issue in a range of versions from 4.4 to 4.19. There are several commits. It looks like only one is included into 4.19, so you need to take 4.20 or backport. > > > > > I made some changes recently to fix a memcg "leak", or better to say, make > > memcg reclaim possible under normal conditions. Before that we were accumulating > > a big number of dying cgroups, which matches your description. > > Is there a commit id? a76cf1a474d7 mm: don't reclaim inodes with many attached pages 68600f623d69 mm: don't miss the last page because of round-off error 591edfb10a94 mm: drain memcg stocks on css offlining 172b06c32b94 mm: slowly shrink slabs with a relatively small number of objects 9b6f7e163cd0 mm: rework memcg kernel stack accounting The last one contained a bug, so there are a fix and a fix of the fix in the current mm tree: 5eed6f1dff87 fork,memcg: fix crash in free_thread_stack on memcg charge fail bb5ac5dfdd3c fork, memcg: fix cached_stacks case Thanks!