On Sat, Jan 4, 2020 at 11:55 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Tue, Dec 24, 2019 at 02:53:26AM -0500, Yafang Shao wrote: > > On my server there're some running MEMCGs protected by memory.{min, low}, > > but I found the usage of these MEMCGs abruptly became very small, which > > were far less than the protect limit. It confused me and finally I > > found that was because of inode stealing. > > Once an inode is freed, all its belonging page caches will be dropped as > > well, no matter how may page caches it has. So if we intend to protect the > > page caches in a memcg, we must protect their host (the inode) first. > > Otherwise the memcg protection can be easily bypassed with freeing inode, > > especially if there're big files in this memcg. > > > > Supposes we have a memcg, and the stat of this memcg is, > > memory.current = 1024M > > memory.min = 512M > > And in this memcg there's a inode with 800M page caches. > > Once this memcg is scanned by kswapd or other regular reclaimers, > > kswapd <<<< It can be either of the regular reclaimers. > > shrink_node_memcgs > > switch (mem_cgroup_protected()) <<<< Not protected > > case MEMCG_PROT_NONE: <<<< Will scan this memcg > > beak; > > shrink_lruvec() <<<< Reclaim the page caches > > shrink_slab() <<<< It may free this inode and drop all its > > page caches(800M). > > So we must protect the inode first if we want to protect page caches. > > > > The inherent mismatch between memcg and inode is a trouble. One inode can > > be shared by different MEMCGs, but it is a very rare case. If an inode is > > shared, its belonging page caches may be charged to different MEMCGs. > > Currently there's no perfect solution to fix this kind of issue, but the > > inode majority-writer ownership switching can help it more or less. > > There's multiple changes in this patch set. Yes, there's some inode > cache futzing to deal with memcgs, but it also adds some weird > undocumented "in low reclaim" heuristic that does something magical > with "protection" that you don't describe or document at all. PLease > separate that out into a new patch and document it clearly (both in > the commit message and the code, please). > Sure, I will separate it and document it in next version. > > diff --git a/fs/inode.c b/fs/inode.c > > index fef457a..4f4b2f3 100644 > > --- a/fs/inode.c > > +++ b/fs/inode.c > > @@ -54,6 +54,13 @@ > > * inode_hash_lock > > */ > > > > +struct inode_head { > > not an "inode head" structure. They are inode isolation arguments. > i.e. struct inode_isolation_args {...}; > Agree with you that inode_isolation_args is better. While I have a different idea, what about using inode_isolation_control ? scan_control : to scan all MEMCGs v shrink_control: to shrink all slabs in one memcg v inode_isolation_control: to shrink one slab (the inode) And in the future we may introduce dentry_isolation_control and some others. Anyway that's a minor issue. > > + struct list_head *freeable; > > +#ifdef CONFIG_MEMCG_KMEM > > + struct mem_cgroup *memcg; > > +#endif > > +}; > > These defines are awful, too, and completely unnecesarily because > the struct shrink_control unconditionally defines sc->memcg and > so we can just code it throughout without caring whether memcgs are > enabled or not. > Sure, will change it in next version. > > + > > static unsigned int i_hash_mask __read_mostly; > > static unsigned int i_hash_shift __read_mostly; > > static struct hlist_head *inode_hashtable __read_mostly; > > @@ -724,8 +731,10 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty) > > static enum lru_status inode_lru_isolate(struct list_head *item, > > struct list_lru_one *lru, spinlock_t *lru_lock, void *arg) > > { > > - struct list_head *freeable = arg; > > + struct inode_head *ihead = (struct inode_head *)arg; > > No need for a cast of a void *. > OK. > > + struct list_head *freeable = ihead->freeable; > > struct inode *inode = container_of(item, struct inode, i_lru); > > + struct mem_cgroup *memcg = NULL; > > struct inode_isolation_args *iargs = arg; > struct list_head *freeable = iargs->freeable; > struct mem_cgroup *memcg = iargs->memcg; > struct inode *inode = container_of(item, struct inode, i_lru); > Thanks. That looks better. > > @@ -734,6 +743,15 @@ static enum lru_status inode_lru_isolate(struct list_head *item, > > if (!spin_trylock(&inode->i_lock)) > > return LRU_SKIP; > > > > +#ifdef CONFIG_MEMCG_KMEM > > + memcg = ihead->memcg; > > +#endif > > This is no longer necessary. > Thanks. > > + if (memcg && inode->i_data.nrpages && > > + !(memcg_can_reclaim_inode(memcg, inode))) { > > + spin_unlock(&inode->i_lock); > > + return LRU_ROTATE; > > + } > > I'd argue that both the memcg and the inode->i_data.nrpages check > should be inside memcg_can_reclaim_inode(), that way this entire > chunk of code disappears when CONFIG_MEMCG_KMEM=N. > I will think about it and make the code better when CONFIG_MEMCG_KMEM=N. > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > index f36ada9..d1d4175 100644 > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -247,6 +247,9 @@ struct mem_cgroup { > > unsigned int tcpmem_active : 1; > > unsigned int tcpmem_pressure : 1; > > > > + /* Soft protection will be ignored if it's true */ > > + unsigned int in_low_reclaim : 1; > > This is the stuff that has nothing to do with the changes to the > inode reclaim shrinker... > In next version I will drop this in_low_reclaim and using the memcg_low_reclaim passed from struct scan_control. Thanks Yafang