Re: [PATCH v2 5/5] memcg, inode: protect page cache from freeing inode

Yafang Shao <laoar.shao@xxxxxxxxx> · Sat, 4 Jan 2020 15:42:21 +0800

On Sat, Jan 4, 2020 at 11:55 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>
> On Tue, Dec 24, 2019 at 02:53:26AM -0500, Yafang Shao wrote:
> > On my server there're some running MEMCGs protected by memory.{min, low},
> > but I found the usage of these MEMCGs abruptly became very small, which
> > were far less than the protect limit. It confused me and finally I
> > found that was because of inode stealing.
> > Once an inode is freed, all its belonging page caches will be dropped as
> > well, no matter how may page caches it has. So if we intend to protect the
> > page caches in a memcg, we must protect their host (the inode) first.
> > Otherwise the memcg protection can be easily bypassed with freeing inode,
> > especially if there're big files in this memcg.
> >
> > Supposes we have a memcg, and the stat of this memcg is,
> >       memory.current = 1024M
> >       memory.min = 512M
> > And in this memcg there's a inode with 800M page caches.
> > Once this memcg is scanned by kswapd or other regular reclaimers,
> >     kswapd <<<< It can be either of the regular reclaimers.
> >         shrink_node_memcgs
> >           switch (mem_cgroup_protected()) <<<< Not protected
> >               case MEMCG_PROT_NONE:  <<<< Will scan this memcg
> >                       beak;
> >             shrink_lruvec() <<<< Reclaim the page caches
> >             shrink_slab()   <<<< It may free this inode and drop all its
> >                                  page caches(800M).
> > So we must protect the inode first if we want to protect page caches.
> >
> > The inherent mismatch between memcg and inode is a trouble. One inode can
> > be shared by different MEMCGs, but it is a very rare case. If an inode is
> > shared, its belonging page caches may be charged to different MEMCGs.
> > Currently there's no perfect solution to fix this kind of issue, but the
> > inode majority-writer ownership switching can help it more or less.
>
> There's multiple changes in this patch set. Yes, there's some inode
> cache futzing to deal with memcgs, but it also adds some weird
> undocumented "in low reclaim" heuristic that does something magical
> with "protection" that you don't describe or document at all. PLease
> separate that out into a new patch and document it clearly (both in
> the commit message and the code, please).
>

Sure, I will separate it and document it in next version.

> > diff --git a/fs/inode.c b/fs/inode.c
> > index fef457a..4f4b2f3 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -54,6 +54,13 @@
> >   *   inode_hash_lock
> >   */
> >
> > +struct inode_head {
>
> not an "inode head" structure. They are inode isolation arguments.
> i.e. struct inode_isolation_args {...};
>

Agree with you that inode_isolation_args is better.
While I have a different idea, what about using inode_isolation_control ?
scan_control : to scan all MEMCGs
        v
shrink_control: to shrink all slabs in one memcg
        v
inode_isolation_control: to shrink one slab (the inode)

And in the future we may introduce dentry_isolation_control and some others.
Anyway that's a minor issue.

> > +     struct list_head *freeable;
> > +#ifdef CONFIG_MEMCG_KMEM
> > +     struct mem_cgroup *memcg;
> > +#endif
> > +};
>
> These defines are awful, too, and completely unnecesarily because
> the struct shrink_control unconditionally defines sc->memcg and
> so we can just code it throughout without caring whether memcgs are
> enabled or not.
>

Sure, will change it in next version.

> > +
> >  static unsigned int i_hash_mask __read_mostly;
> >  static unsigned int i_hash_shift __read_mostly;
> >  static struct hlist_head *inode_hashtable __read_mostly;
> > @@ -724,8 +731,10 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
> >  static enum lru_status inode_lru_isolate(struct list_head *item,
> >               struct list_lru_one *lru, spinlock_t *lru_lock, void *arg)
> >  {
> > -     struct list_head *freeable = arg;
> > +     struct inode_head *ihead = (struct inode_head *)arg;
>
> No need for a cast of a void *.
>

OK.

> > +     struct list_head *freeable = ihead->freeable;
> >       struct inode    *inode = container_of(item, struct inode, i_lru);
> > +     struct mem_cgroup *memcg = NULL;
>
>         struct inode_isolation_args *iargs = arg;
>         struct list_head *freeable = iargs->freeable;
>         struct mem_cgroup *memcg = iargs->memcg;
>         struct inode    *inode = container_of(item, struct inode, i_lru);
>

Thanks. That looks better.

> > @@ -734,6 +743,15 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
> >       if (!spin_trylock(&inode->i_lock))
> >               return LRU_SKIP;
> >
> > +#ifdef CONFIG_MEMCG_KMEM
> > +     memcg = ihead->memcg;
> > +#endif
>
> This is no longer necessary.
>

Thanks.

> > +     if (memcg && inode->i_data.nrpages &&
> > +         !(memcg_can_reclaim_inode(memcg, inode))) {
> > +             spin_unlock(&inode->i_lock);
> > +             return LRU_ROTATE;
> > +     }
>
> I'd argue that both the memcg and the inode->i_data.nrpages check
> should be inside memcg_can_reclaim_inode(), that way this entire
> chunk of code disappears when CONFIG_MEMCG_KMEM=N.
>

I will think about it and make the code better when CONFIG_MEMCG_KMEM=N.

> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index f36ada9..d1d4175 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -247,6 +247,9 @@ struct mem_cgroup {
> >       unsigned int tcpmem_active : 1;
> >       unsigned int tcpmem_pressure : 1;
> >
> > +     /* Soft protection will be ignored if it's true */
> > +     unsigned int in_low_reclaim : 1;
>
> This is the stuff that has nothing to do with the changes to the
> inode reclaim shrinker...
>

In next version I will drop this in_low_reclaim and using the
memcg_low_reclaim passed from struct scan_control.

Thanks
Yafang