Re: [patch 6/6] mm: fsync livelock avoidance

Nick Piggin <npiggin@xxxxxxx> · Thu, 11 Dec 2008 23:32:13 +0100

On Thu, Dec 11, 2008 at 01:51:11PM -0800, Andrew Morton wrote:
> On Wed, 10 Dec 2008 08:42:09 +0100
> Nick Piggin <npiggin@xxxxxxx> wrote:
> > 
> > This lock also solves a real data integrity problem that I only noticed as
> > I was writing the livelock avoidance code. If we consider the lock as the
> > solution to this bug, this makes the livelock avoidance code much more
> > attractive because then it does not introduce the new lock.
> > 
> > The bug is that fsync errors do not get propogated back up to the caller
> > properly in some cases. Consider where we write a page in the writeout path,
> > then it encounters an IO error and finishes writeback, in the meantime, another
> > process (eg. via sys_sync, or another fsync) clears the mapping error bits.
> > Then our fsync will have appeared to finish successfully, but actually should
> > have returned error.
> 
> Has *anybody* *ever* complained about this behaviour?  I think maybe
> one person after sixish years?

The livelock behaviour? (or the error propagation).

I first heard about it from Mikulas, where some dm tool locks up because
it does direct IO on the block device of mounted filesystem (or something
like that). That case is actually mostly solved by my first ptach in the
series. 

> Why fix it?

Good question. My earlier patches already in your tree removed some starvation
avoidance code because they were breaking data integrity semantics. So in
theory, your tree today is more susceptible to this sync/fsync starvation
than mainline. I care most about the correctness, and it would be great if
nobody cares about this starvation problem so we don't need the extra
complexity.

Thanks for the review.

> > +void mapping_fsync_lock(struct address_space *mapping)
> > +{
> > +	wait_on_bit_lock(&mapping->flags, AS_FSYNC_LOCK, sleep_on_fsync,
> > +							TASK_UNINTERRUPTIBLE);
> > +	WARN_ON(mapping_tagged(mapping, PAGECACHE_TAG_FSYNC));
> > +}
> > +
> > +void mapping_fsync_unlock(struct address_space *mapping)
> > +{
> > +	WARN_ON(mapping_tagged(mapping, PAGECACHE_TAG_FSYNC));
> > +	WARN_ON(!test_bit(AS_FSYNC_LOCK, &mapping->flags));
> > +	clear_bit_unlock(AS_FSYNC_LOCK, &mapping->flags);
> > +	smp_mb__after_clear_bit();
> 
> hm, I wonder why clear_bit_unlock() didn't already do that.

Strictly unlock semantics. So it has the mb before the clear bit,
but none after.

> The clear_bit_unlock() documentation is rather crappy.
> 
> > +	wake_up_bit(&mapping->flags, AS_FSYNC_LOCK);
> > +}
> 
> It wouldn't hurt to document this interface a little bit.

True.

> > +int wait_on_page_writeback_range_fsync(struct address_space *mapping,
> > +				pgoff_t start, pgoff_t end)
> 
> We already have a wait_on_page_writeback_range().  The reader of your
> code will be wondering why this variant exists, and how it differs. 
> Sigh.

Ah OK, wait_on_page_writeback_range_fsync can be used when the caller
has set up the fsync tags and holds the mapping bit lock. So unconverted
filesystems hopefully can use the old code without blowing up (they just
would be more prone to starvation).

> >  	if (!mapping_cap_writeback_dirty(mapping) || !count)
> >  		return 0;
> > +	mutex_lock(&inode->i_mutex);
> 
> I am unable to determine why i_mutex is being taken here.

Oh, good point (which I probably didn't mention). mapping_fsync_lock
nests inside i_mutex. Will add that to documentation and lock order
graphs.

> > @@ -897,13 +899,40 @@ int write_cache_pages(struct address_spa
> >  			range_whole = 1;
> >  		cycled = 1; /* ignore range_cyclic tests */
> >  	}
> > +
> > +	if (sync) {
> > +		WARN_ON(!test_bit(AS_FSYNC_LOCK, &mapping->flags));
> 
> hm.  Is fsync the only caller of write_cache_pages(!WB_SYNC_NONE)? 
> Surprised.

No some filesystems and things also call eg. filemap_write_and_wait
which should come here too. But they also need to take the lock.

> > +		if (!radix_tree_gang_tag_set_if_tagged(&mapping->page_tree,
> > +							index, end,
> > +				(1UL << PAGECACHE_TAG_DIRTY) |
> > +				(1UL << PAGECACHE_TAG_WRITEBACK),
> > +				(1UL << PAGECACHE_TAG_FSYNC))) {
> 
> ooh, so that's what that thing does.

Maybe an example is a better documentation than my waffling.

> > +			/* nothing tagged */
> > +			spin_unlock_irq(&mapping->tree_lock);
> > +			return 0;
> 
> Can we please avoid the deeply-nested-return hand grenade?

Hmm, we could 

   goto out;
...
out:
 return ret;

But is that less hand grenadie than the plain return?

> > ===================================================================
> > --- linux-2.6.orig/drivers/usb/gadget/file_storage.c
> > +++ linux-2.6/drivers/usb/gadget/file_storage.c
> > @@ -1873,13 +1873,15 @@ static int fsync_sub(struct lun *curlun)
> >  
> >  	inode = filp->f_path.dentry->d_inode;
> >  	mutex_lock(&inode->i_mutex);
> > +	mapping_fsync_lock(mapping);
> 
> Dood. Do `make allmodconfig ; make'

OK.

> >  	rc = filemap_fdatawrite(inode->i_mapping);
> >  	err = filp->f_op->fsync(filp, filp->f_path.dentry, 1);
> >  	if (!rc)
> >  		rc = err;
> > -	err = filemap_fdatawait(inode->i_mapping);
> > +	err = filemap_fdatawait_fsync(inode->i_mapping);
> >  	if (!rc)
> >  		rc = err;
> > +	mapping_fsync_unlock(mapping);
> >
> > ...
> >
> 
> 
> I won't apply this because of the build breakage.

OK.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html