Re: [PATCH v11] ovl: Improving syncfs efficiency

Jan Kara <jack@xxxxxxx> · Fri, 17 Apr 2020 11:26:20 +0200

On Fri 17-04-20 09:23:07, Chengguang Xu wrote:
>  ---- 在 星期四, 2020-04-16 22:33:49 Jan Kara <jack@xxxxxxx> 撰写 ----
>  > On Thu 16-04-20 21:52:27, Chengguang Xu wrote:
>  > >  ---- 在 星期四, 2020-04-16 19:14:24 Jan Kara <jack@xxxxxxx> 撰写 ----
>  > >  > On Thu 16-04-20 14:08:59, Chengguang Xu wrote:
>  > >  > >  ---- 在 星期四, 2020-04-16 03:19:50 Miklos Szeredi <miklos@xxxxxxxxxx> 撰写 ----
>  > >  > >  > On Mon, Feb 10, 2020 at 4:10 AM Chengguang Xu <cgxu519@xxxxxxxxxxxx> wrote:
>  > >  > >  > > +void ovl_evict_inode(struct inode *inode)
>  > >  > >  > > +{
>  > >  > >  > > +       struct ovl_inode *oi = OVL_I(inode);
>  > >  > >  > > +       struct ovl_write_inode_work ovl_wiw;
>  > >  > >  > > +       DEFINE_WAIT_BIT(wait, &oi->flags, OVL_WRITE_INODE_PENDING);
>  > >  > >  > > +       wait_queue_head_t *wqh;
>  > >  > >  > > +
>  > >  > >  > > +       if (ovl_inode_upper(inode)) {
>  > >  > >  > > +               if (current->flags & PF_MEMALLOC) {
>  > >  > >  > > +                       spin_lock(&inode->i_lock);
>  > >  > >  > > +                       ovl_set_flag(OVL_WRITE_INODE_PENDING, inode);
>  > >  > >  > > +                       wqh = bit_waitqueue(&oi->flags,
>  > >  > >  > > +                                       OVL_WRITE_INODE_PENDING);
>  > >  > >  > > +                       prepare_to_wait(wqh, &wait.wq_entry,
>  > >  > >  > > +                                       TASK_UNINTERRUPTIBLE);
>  > >  > >  > > +                       spin_unlock(&inode->i_lock);
>  > >  > >  > > +
>  > >  > >  > > +                       ovl_wiw.inode = inode;
>  > >  > >  > > +                       INIT_WORK(&ovl_wiw.work, ovl_write_inode_work_fn);
>  > >  > >  > > +                       schedule_work(&ovl_wiw.work);
>  > >  > >  > > +
>  > >  > >  > > +                       schedule();
>  > >  > >  > > +                       finish_wait(wqh, &wait.wq_entry);
>  > >  > >  > 
>  > >  > >  > What is the reason to do this in another thread if this is a PF_MEMALLOC task?
>  > >  > > 
>  > >  > > Some underlying filesystems(for example ext4) check the flag in
>  > >  > > ->write_inode() and treate it as an abnormal case.(warn and return)
>  > >  > > 
>  > >  > > ext4_write_inode():
>  > >  > >         if (WARN_ON_ONCE(current->flags & PF_MEMALLOC) ||
>  > >  > >                 sb_rdonly(inode->i_sb))
>  > >  > >                         return 0;
>  > >  > > 
>  > >  > > overlayfs inodes are always keeping clean even after wring/modifying
>  > >  > > upperfile , so they are right target of kswapd  but in the point of lower
>  > >  > > layer, ext4 just thinks kswapd is choosing a wrong dirty inode to reclam
>  > >  > > memory.
>  > >  > 
>  > >  > In ext4, it isn't a big problem if ext4_write_inode() is called from
>  > >  > kswapd. But if ext4_write_inode() is called from direct reclaim (which also
>  > >  > sets PF_MEMALLOC) we can deadlock because we may wait for transaction
>  > >  > commit and transaction commit may require locks (such as page lock or
>  > >  > waiting for page writeback to complete) which are held by the task
>  > >  > currently in direct reclaim. Your push to workqueue will silence the
>  > >  > warning but does not solve the possible deadlock.
>  > >  > 
>  > >  > I'm actually not sure why you need to writeback the upper inode when
>  > >  > reclaiming overlayfs inode. Why not leave it on flush worker on upper fs?
>  > >  > 
>  > > 
>  > > Because it is the last chance we can sync dirty upper inode, I mean after
>  > > evicting overlayfs inode we can not find the associated dirty upper inode
>  > > from any list and that dirty upper inode will be skipped from the target
>  > > of syncfs().
>  > 
>  > I see. But this flushing of dirty inodes on reclaim really isn't a great
>  > idea. It can also stall reclaim (due to it being stuck waiting for IO) and
>  > thus lead to bad behavior in low memory situations. It's better to just
>  > skip reclaiming such inodes - but then I agree it's a difficult question
>  > when to reclaim them. Ideally you'd need to hook into inode_lru_isolate()
>  > but that's just too ugly (but maybe it could be done in some clean manner).
>  > 
> 
> How about prepare another list to organize  this kind of inode, when evicting overlay
> inode we grab dirty upper inode reference, put a entry(a field point to upper inode) 
> into this new list and retrun, after that periodically checking list entry to release(iput)
> clean upper inode.
> 
> In overlayfs's syncfs(), we also need to iterate this new list after iterating oi->upper_inodes_list.

Sure, that would work.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR