On Wed, Mar 21, 2012 at 11:25:50AM +0100, Jan Kara wrote: > On Wed 21-03-12 10:50:05, Dave Chinner wrote: > > On Tue, Mar 20, 2012 at 11:56:31PM +0100, Jan Kara wrote: > > > Doing iput() from flusher thread (writeback_sb_inodes()) can create problems > > > because iput() can do a lot of work - for example truncate the inode if it's > > > the last iput on unlinked file. Some filesystems depend on flusher thread > > > progressing (e.g. because they need to flush delay allocated blocks to reduce > > > allocation uncertainty) and so flusher thread doing truncate creates > > > interesting dependencies and possibilities for deadlocks. > > > > > > We get rid of iput() in flusher thread by using the fact that I_SYNC inode > > > flag effectively pins the inode in memory. So if we take care to either hold > > > i_lock or have I_SYNC set, we can get away without taking inode reference > > > in writeback_sb_inodes(). > > > > > > To make things work we have to move waiting for I_SYNC from end_writeback() to > > > evict() just before calling of ->evict_inode. This is because several > > > filesystems call end_writeback() after they have deleted the inode (btrfs, > > > gfs2, ...) and other filesystems (ext3, ext4, reiserfs, ...) can deadlock when > > > waiting for I_SYNC because they call end_writeback() from within a transaction. > > > Both were not really a problem previously because flusher thread and > > > ->evict_inode() could not run in parallel but now these two could race. > > > So moving of I_SYNC wait prevents possible races.. > > > > > > As a side effect of these changes, we also fix possible use-after-free in > > > wb_writeback() because inode_wait_for_writeback() call could try to reacquire > > > i_lock on the inode that was already free. > > ..... > > > > > diff --git a/fs/inode.c b/fs/inode.c > > > index d3ebdbe..3869714 100644 > > > --- a/fs/inode.c > > > +++ b/fs/inode.c > > > @@ -510,7 +510,6 @@ void end_writeback(struct inode *inode) > > > BUG_ON(!list_empty(&inode->i_data.private_list)); > > > BUG_ON(!(inode->i_state & I_FREEING)); > > > BUG_ON(inode->i_state & I_CLEAR); > > > - inode_sync_wait(inode); > > > /* don't need i_lock here, no concurrent mods to i_state */ > > > inode->i_state = I_FREEING | I_CLEAR; > > > } > > > @@ -541,6 +540,18 @@ static void evict(struct inode *inode) > > > > > > inode_sb_list_del(inode); > > > > > > + /* > > > + * Wait for flusher thread to be done with the inode so that filesystem > > > + * does not start destroying it while writeback is still running. Since > > > + * the inode has I_FREEING set, flusher thread won't start new work on > > > + * the inode. We just have to wait for running writeback to finish. We > > > + * must use i_lock here because flusher thread might be working with > > > + * the inode without I_SYNC set but under i_lock. > > > + */ > > > + spin_lock(&inode->i_lock); > > > + inode_wait_for_writeback(inode); > > > + spin_unlock(&inode->i_lock); > > > + > > > > Why move this wait from end_writeback() to here? The whole point > > of end_writeback() is to provide a barrier that guarantees that > > there is no async writeback running when it returns, so it seems > > strange to move the barrier out of the function that is supposed to > > provide the barrier.... > I agree that end_writeback() will be misnamed after this change. The > thing is (as I tried to explain in the changelog) that a lot of filesystems > get it wrong and call end_writeback() from places where > a) it is too late and writeback could be scribbling over a freed inode > b) they cannot really handle waiting for writeback to finish > And nobody really noticed because writeback couldn't be racing with > ->evict_inode(). > > It is not easy to fix this e.g. in GFS2 because end_writeback() does > actually two things: It checks that inode is properly teared down (has no > pages etc.) and waits for writeback. And while waiting for writeback should > happen outside of a running transaction, checking of inode has to happen > between truncate_inode_pages() and deleting inode which, in case of GFS2, > has to be inside a transaction. > > So what I decided to do is to split off waiting for writeback from checking > the inode and move waiting for writeback before ->evict_inode() is called. > That makes life easier for the filesystems and AFAICS it is a safe thing to > do. > > Maybe we could rename end_writeback() to something like clear_inode() to > reflect this change? end_writeback() does not imply "sleeping", but you'll at least remove that explicit might_sleep() line inside it ;) Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html