On Wed, Dec 18, 2019 at 10:21 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Tue, Dec 17, 2019 at 06:29:19AM -0500, Yafang Shao wrote: > > On my server there're some running MEMCGs protected by memory.{min, low}, > > but I found the usage of these MEMCGs abruptly became very small, which > > were far less than the protect limit. It confused me and finally I > > found that was because of inode stealing. > > Once an inode is freed, all its belonging page caches will be dropped as > > well, no matter how may page caches it has. So if we intend to protect the > > page caches in a memcg, we must protect their host (the inode) first. > > Otherwise the memcg protection can be easily bypassed with freeing inode, > > especially if there're big files in this memcg. > > The inherent mismatch between memcg and inode is a trouble. One inode can > > be shared by different MEMCGs, but it is a very rare case. If an inode is > > shared, its belonging page caches may be charged to different MEMCGs. > > Currently there's no perfect solution to fix this kind of issue, but the > > inode majority-writer ownership switching can help it more or less. > > > > Cc: Roman Gushchin <guro@xxxxxx> > > Cc: Chris Down <chris@xxxxxxxxxxxxxx> > > Cc: Dave Chinner <dchinner@xxxxxxxxxx> > > Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx> > > --- > > fs/inode.c | 9 +++++++++ > > include/linux/memcontrol.h | 15 +++++++++++++++ > > mm/memcontrol.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ > > mm/vmscan.c | 4 ++++ > > 4 files changed, 74 insertions(+) > > > > diff --git a/fs/inode.c b/fs/inode.c > > index fef457a..b022447 100644 > > --- a/fs/inode.c > > +++ b/fs/inode.c > > @@ -734,6 +734,15 @@ static enum lru_status inode_lru_isolate(struct list_head *item, > > if (!spin_trylock(&inode->i_lock)) > > return LRU_SKIP; > > > > + > > + /* Page protection only works in reclaimer */ > > + if (inode->i_data.nrpages && current->reclaim_state) { > > + if (mem_cgroup_inode_protected(inode)) { > > + spin_unlock(&inode->i_lock); > > + return LRU_ROTATE; > > Urk, so after having plumbed the memcg all the way down to the > list_lru walk code so that we only walk inodes in that memcg, we now > have to do a lookup from the inode back to the owner memcg to > determine if we should reclaim it? IOWs, I think the layering here > is all wrong - if memcg info is needed in the shrinker, it should > come from the shrink_control->memcg pointer, not be looked up from > the object being isolated... > Agree with you that the layering here is not good. I had tried to use shrink_control->memcg pointer as an argument or something else, but I found that will change lots of code. I don't want to change too much code, so I implement it this way, although it looks a litte strange. > i.e. this code should read something like this: > > if (memcg && inode->i_data.nrpages && > (!memcg_can_reclaim_inode(memcg, inode)) { > spin_unlock(&inode->i_lock); > return LRU_ROTATE; > } > > This code does not need comments because it is obvious what it does, > and it provides a generic hook into inode reclaim for the memcg code > to decide whether the shrinker should reclaim the inode or not. > > This is how the memcg code should interact with other shrinkers, too > (e.g. the dentry cache isolation function), so you need to look at > how to make the memcg visible to the lru walker isolation > functions.... > Thanks for your suggestion. I will rethink it torwards this way. Thanks Yafang