Re: [PATCH v11] ovl: Improving syncfs efficiency

Chengguang Xu <cgxu519@xxxxxxxxxxxx> · Fri, 17 Apr 2020 09:23:07 +0800

 ---- 在 星期四, 2020-04-16 22:33:49 Jan Kara <jack@xxxxxxx> 撰写 ----
 > On Thu 16-04-20 21:52:27, Chengguang Xu wrote:
 > >  ---- 在 星期四, 2020-04-16 19:14:24 Jan Kara <jack@xxxxxxx> 撰写 ----
 > >  > On Thu 16-04-20 14:08:59, Chengguang Xu wrote:
 > >  > >  ---- 在 星期四, 2020-04-16 03:19:50 Miklos Szeredi <miklos@xxxxxxxxxx> 撰写 ----
 > >  > >  > On Mon, Feb 10, 2020 at 4:10 AM Chengguang Xu <cgxu519@xxxxxxxxxxxx> wrote:
 > >  > >  > > +void ovl_evict_inode(struct inode *inode)
 > >  > >  > > +{
 > >  > >  > > +       struct ovl_inode *oi = OVL_I(inode);
 > >  > >  > > +       struct ovl_write_inode_work ovl_wiw;
 > >  > >  > > +       DEFINE_WAIT_BIT(wait, &oi->flags, OVL_WRITE_INODE_PENDING);
 > >  > >  > > +       wait_queue_head_t *wqh;
 > >  > >  > > +
 > >  > >  > > +       if (ovl_inode_upper(inode)) {
 > >  > >  > > +               if (current->flags & PF_MEMALLOC) {
 > >  > >  > > +                       spin_lock(&inode->i_lock);
 > >  > >  > > +                       ovl_set_flag(OVL_WRITE_INODE_PENDING, inode);
 > >  > >  > > +                       wqh = bit_waitqueue(&oi->flags,
 > >  > >  > > +                                       OVL_WRITE_INODE_PENDING);
 > >  > >  > > +                       prepare_to_wait(wqh, &wait.wq_entry,
 > >  > >  > > +                                       TASK_UNINTERRUPTIBLE);
 > >  > >  > > +                       spin_unlock(&inode->i_lock);
 > >  > >  > > +
 > >  > >  > > +                       ovl_wiw.inode = inode;
 > >  > >  > > +                       INIT_WORK(&ovl_wiw.work, ovl_write_inode_work_fn);
 > >  > >  > > +                       schedule_work(&ovl_wiw.work);
 > >  > >  > > +
 > >  > >  > > +                       schedule();
 > >  > >  > > +                       finish_wait(wqh, &wait.wq_entry);
 > >  > >  > 
 > >  > >  > What is the reason to do this in another thread if this is a PF_MEMALLOC task?
 > >  > > 
 > >  > > Some underlying filesystems(for example ext4) check the flag in
 > >  > > ->write_inode() and treate it as an abnormal case.(warn and return)
 > >  > > 
 > >  > > ext4_write_inode():
 > >  > >         if (WARN_ON_ONCE(current->flags & PF_MEMALLOC) ||
 > >  > >                 sb_rdonly(inode->i_sb))
 > >  > >                         return 0;
 > >  > > 
 > >  > > overlayfs inodes are always keeping clean even after wring/modifying
 > >  > > upperfile , so they are right target of kswapd  but in the point of lower
 > >  > > layer, ext4 just thinks kswapd is choosing a wrong dirty inode to reclam
 > >  > > memory.
 > >  > 
 > >  > In ext4, it isn't a big problem if ext4_write_inode() is called from
 > >  > kswapd. But if ext4_write_inode() is called from direct reclaim (which also
 > >  > sets PF_MEMALLOC) we can deadlock because we may wait for transaction
 > >  > commit and transaction commit may require locks (such as page lock or
 > >  > waiting for page writeback to complete) which are held by the task
 > >  > currently in direct reclaim. Your push to workqueue will silence the
 > >  > warning but does not solve the possible deadlock.
 > >  > 
 > >  > I'm actually not sure why you need to writeback the upper inode when
 > >  > reclaiming overlayfs inode. Why not leave it on flush worker on upper fs?
 > >  > 
 > > 
 > > Because it is the last chance we can sync dirty upper inode, I mean after
 > > evicting overlayfs inode we can not find the associated dirty upper inode
 > > from any list and that dirty upper inode will be skipped from the target
 > > of syncfs().
 > 
 > I see. But this flushing of dirty inodes on reclaim really isn't a great
 > idea. It can also stall reclaim (due to it being stuck waiting for IO) and
 > thus lead to bad behavior in low memory situations. It's better to just
 > skip reclaiming such inodes - but then I agree it's a difficult question
 > when to reclaim them. Ideally you'd need to hook into inode_lru_isolate()
 > but that's just too ugly (but maybe it could be done in some clean manner).
 > 

How about prepare another list to organize  this kind of inode, when evicting overlay
inode we grab dirty upper inode reference, put a entry(a field point to upper inode) 
into this new list and retrun, after that periodically checking list entry to release(iput)
clean upper inode.

In overlayfs's syncfs(), we also need to iterate this new list after iterating oi->upper_inodes_list.

Thanks,
cgxu