Re: [PATCH 7/7] writeback: Avoid iput() from flusher thread

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed 21-03-12 10:50:05, Dave Chinner wrote:
> On Tue, Mar 20, 2012 at 11:56:31PM +0100, Jan Kara wrote:
> > Doing iput() from flusher thread (writeback_sb_inodes()) can create problems
> > because iput() can do a lot of work - for example truncate the inode if it's
> > the last iput on unlinked file. Some filesystems depend on flusher thread
> > progressing (e.g. because they need to flush delay allocated blocks to reduce
> > allocation uncertainty) and so flusher thread doing truncate creates
> > interesting dependencies and possibilities for deadlocks.
> > 
> > We get rid of iput() in flusher thread by using the fact that I_SYNC inode
> > flag effectively pins the inode in memory. So if we take care to either hold
> > i_lock or have I_SYNC set, we can get away without taking inode reference
> > in writeback_sb_inodes().
> > 
> > To make things work we have to move waiting for I_SYNC from end_writeback() to
> > evict() just before calling of ->evict_inode. This is because several
> > filesystems call end_writeback() after they have deleted the inode (btrfs,
> > gfs2, ...) and other filesystems (ext3, ext4, reiserfs, ...) can deadlock when
> > waiting for I_SYNC because they call end_writeback() from within a transaction.
> > Both were not really a problem previously because flusher thread and
> > ->evict_inode() could not run in parallel but now these two could race.
> > So moving of I_SYNC wait prevents possible races..
> > 
> > As a side effect of these changes, we also fix possible use-after-free in
> > wb_writeback() because inode_wait_for_writeback() call could try to reacquire
> > i_lock on the inode that was already free.
> .....
> 
> > diff --git a/fs/inode.c b/fs/inode.c
> > index d3ebdbe..3869714 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -510,7 +510,6 @@ void end_writeback(struct inode *inode)
> >  	BUG_ON(!list_empty(&inode->i_data.private_list));
> >  	BUG_ON(!(inode->i_state & I_FREEING));
> >  	BUG_ON(inode->i_state & I_CLEAR);
> > -	inode_sync_wait(inode);
> >  	/* don't need i_lock here, no concurrent mods to i_state */
> >  	inode->i_state = I_FREEING | I_CLEAR;
> >  }
> > @@ -541,6 +540,18 @@ static void evict(struct inode *inode)
> >  
> >  	inode_sb_list_del(inode);
> >  
> > +	/*
> > +	 * Wait for flusher thread to be done with the inode so that filesystem
> > +	 * does not start destroying it while writeback is still running. Since
> > +	 * the inode has I_FREEING set, flusher thread won't start new work on
> > +	 * the inode.  We just have to wait for running writeback to finish. We
> > +	 * must use i_lock here because flusher thread might be working with
> > +	 * the inode without I_SYNC set but under i_lock.
> > +	 */
> > +	spin_lock(&inode->i_lock);
> > +	inode_wait_for_writeback(inode);
> > +	spin_unlock(&inode->i_lock);
> > +
> 
> Why move this wait from end_writeback() to here?  The whole point
> of end_writeback() is to provide a barrier that guarantees that
> there is no async writeback running when it returns, so it seems
> strange to move the barrier out of the function that is supposed to
> provide the barrier....
  I agree that end_writeback() will be misnamed after this change. The
thing is (as I tried to explain in the changelog) that a lot of filesystems
get it wrong and call end_writeback() from places where
  a) it is too late and writeback could be scribbling over a freed inode
  b) they cannot really handle waiting for writeback to finish
And nobody really noticed because writeback couldn't be racing with
->evict_inode().

It is not easy to fix this e.g. in GFS2 because end_writeback() does
actually two things: It checks that inode is properly teared down (has no
pages etc.) and waits for writeback. And while waiting for writeback should
happen outside of a running transaction, checking of inode has to happen
between truncate_inode_pages() and deleting inode which, in case of GFS2,
has to be inside a transaction.

So what I decided to do is to split off waiting for writeback from checking
the inode and move waiting for writeback before ->evict_inode() is called.
That makes life easier for the filesystems and AFAICS it is a safe thing to
do.

Maybe we could rename end_writeback() to something like clear_inode() to
reflect this change?

								Honza
-- 
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux