On Thu 22-03-12 17:27:56, Dave Chinner wrote: > On Wed, Mar 21, 2012 at 11:25:50AM +0100, Jan Kara wrote: > > On Wed 21-03-12 10:50:05, Dave Chinner wrote: > > > On Tue, Mar 20, 2012 at 11:56:31PM +0100, Jan Kara wrote: > > > > Doing iput() from flusher thread (writeback_sb_inodes()) can create problems > > > > because iput() can do a lot of work - for example truncate the inode if it's > > > > the last iput on unlinked file. Some filesystems depend on flusher thread > > > > progressing (e.g. because they need to flush delay allocated blocks to reduce > > > > allocation uncertainty) and so flusher thread doing truncate creates > > > > interesting dependencies and possibilities for deadlocks. > > > > > > > > We get rid of iput() in flusher thread by using the fact that I_SYNC inode > > > > flag effectively pins the inode in memory. So if we take care to either hold > > > > i_lock or have I_SYNC set, we can get away without taking inode reference > > > > in writeback_sb_inodes(). > > > > > > > > To make things work we have to move waiting for I_SYNC from end_writeback() to > > > > evict() just before calling of ->evict_inode. This is because several > > > > filesystems call end_writeback() after they have deleted the inode (btrfs, > > > > gfs2, ...) and other filesystems (ext3, ext4, reiserfs, ...) can deadlock when > > > > waiting for I_SYNC because they call end_writeback() from within a transaction. > > > > Both were not really a problem previously because flusher thread and > > > > ->evict_inode() could not run in parallel but now these two could race. > > > > So moving of I_SYNC wait prevents possible races.. > > > > > > > > As a side effect of these changes, we also fix possible use-after-free in > > > > wb_writeback() because inode_wait_for_writeback() call could try to reacquire > > > > i_lock on the inode that was already free. > > > ..... > > > > > > > diff --git a/fs/inode.c b/fs/inode.c > > > > index d3ebdbe..3869714 100644 > > > > --- a/fs/inode.c > > > > +++ b/fs/inode.c > > > > @@ -510,7 +510,6 @@ void end_writeback(struct inode *inode) > > > > BUG_ON(!list_empty(&inode->i_data.private_list)); > > > > BUG_ON(!(inode->i_state & I_FREEING)); > > > > BUG_ON(inode->i_state & I_CLEAR); > > > > - inode_sync_wait(inode); > > > > /* don't need i_lock here, no concurrent mods to i_state */ > > > > inode->i_state = I_FREEING | I_CLEAR; > > > > } > > > > @@ -541,6 +540,18 @@ static void evict(struct inode *inode) > > > > > > > > inode_sb_list_del(inode); > > > > > > > > + /* > > > > + * Wait for flusher thread to be done with the inode so that filesystem > > > > + * does not start destroying it while writeback is still running. Since > > > > + * the inode has I_FREEING set, flusher thread won't start new work on > > > > + * the inode. We just have to wait for running writeback to finish. We > > > > + * must use i_lock here because flusher thread might be working with > > > > + * the inode without I_SYNC set but under i_lock. > > > > + */ > > > > + spin_lock(&inode->i_lock); > > > > + inode_wait_for_writeback(inode); > > > > + spin_unlock(&inode->i_lock); > > > > + > > > > > > Why move this wait from end_writeback() to here? The whole point > > > of end_writeback() is to provide a barrier that guarantees that > > > there is no async writeback running when it returns, so it seems > > > strange to move the barrier out of the function that is supposed to > > > provide the barrier.... > > I agree that end_writeback() will be misnamed after this change. The > > thing is (as I tried to explain in the changelog) that a lot of filesystems > > get it wrong and call end_writeback() from places where > > a) it is too late and writeback could be scribbling over a freed inode > > b) they cannot really handle waiting for writeback to finish > > And nobody really noticed because writeback couldn't be racing with > > ->evict_inode(). > > > > It is not easy to fix this e.g. in GFS2 because end_writeback() does > > actually two things: It checks that inode is properly teared down (has no > > pages etc.) and waits for writeback. And while waiting for writeback should > > happen outside of a running transaction, checking of inode has to happen > > between truncate_inode_pages() and deleting inode which, in case of GFS2, > > has to be inside a transaction. > > I don't see any problem there. Are you concerned simply because it > is called from the flusher thread? So one of my concern is that if inode_sync_wait() in end_writeback() ever did anything, then lots of filesystems would start to have problems and some (e.g. GFS2 as I wrote above) would be unfixable by just moving end_writeback() elsewhere. IMO it would be much saner to put the writeback barrier in VFS before ->evict_inode() is called. Currently, this is implicitely true because flusher thread holds inode reference and I'd like to formulate it as a rule. > We currently have a sane inode lifecycle model via the reference > counting of all active users of inodes, including flusher threads. > IIUC, your concern is that some filesystems have added dependencies > that mean it is unsafe for them to drop the final reference to an > inode in certain contexts? Isn't that a problem with the filesystem > implementation rather than a flaw in the inode reference counting > model? Well, there are contexts where calling iput_final() is fine, there are contexts where it certainly is not, and I believe flusher threads are somewhat on the boundary. The reason for this whole iput() from flusher thread exercise is that mm guys would like to get a reasonable guarantee of a forward progress of an allocation in a specific node / zone (so that they can get rid of writepage() call from kswapd). That needs targetted cleaning by flusher thread and some guarantee that flusher thread can make progress. It is easier to make such arguments when you know iput_final() won't happen from the flusher thread because then you know only ->writepages() and ->write_inode() can happen from flusher thread. > Indeed, XFS is probably the filesystem that has done transactions in > the context iput_final() for the longest - it has done this from the > initial port to linux (because that's how it worked on Irix). XFS > has always done truncation of extents beyond EOF on inodes that need > it, and truncation and final unlink of files with a zero link count > when the final reference to an inode is dropped. > > Also, XFS now relies on the flusher threads to write all the file > data in the page cache, so if it stalled the flusher thread because > truncation had dependencies on data flushing, then it would have > been broken years ago. XFS avoids dependencies between iput_final > and data flushing by ensuring that it doesn't do transactions until > after waiting for the writeback to end. That is, after > end_writeback() returns in .evict, the inode is guaranteed to be > clean, not visible to writeback, and not dependent on any other > inode (data or metadata) to be truncated. And because of that, we > know that iput_final() will not be stalled in any context - we have > a guarantee of forward progress. Sure I believe XFS is perfect and doesn't have problems with iput_final() from flusher thread ;). However there are also other filesystems... > IMO, ensuring data/metadata dependencies don't get tangled are a > filesystem design issue. Trying to hack around them by special > casing certain types of inode usage in the VFS to move it outside > the reference counting model is not the right solution. I really > think that is a slippery slope - the next deadlock will be solved > with some new lifecycle hack, and soon we'll end up with a lifecycle > model that nobody understands or can debug.... I agree we trade somewhat simpler rules for flusher thread / filesystems for some complexity in inode lifecycle. My opinon is that it's worth it. Honza -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html