Re: [patch 6/6] mm: fsync livelock avoidance

Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> · Thu, 11 Dec 2008 13:51:11 -0800

On Wed, 10 Dec 2008 08:42:09 +0100
Nick Piggin <npiggin@xxxxxxx> wrote:

> OK, there has not been any further discussion on this approach since I last
> posted it, so I am going to go out on a limb and suggest that we take this
> approach, if any, rather than Mikulas' one.
> 
> The advantages of my approach I think:
> - nothing added to non-sync fastpaths
> - close to theoretically fewest number of pages will be written / waited for
>   in the fsync case
> - works nicely even in unusual cases (eg. file with 1GB of dirty data, but
>   the fsync_range only needs to sync a few pages will not get stuck behind
>   a concurrent dirtier)
> 
> And some comments:
> - adds 8 bytes to the radix tree node, but this doesn't really change its
>   fitting into slab or cachelines, so effective impact is basically zero
>   for this addition.
> - adds an extra lock, but as per the comments, this lock seems to be required
>   in order to fix a bug anyway. And we already tend to hold i_mutex over at
>   least some of the fsync operation. Although if anybody thinks this will be
>   a problem, I'd like to hear.
> 
> Disadvantages:
> - more complex. Although in a way I consider Mikulas' change to have
>   more complex heuristics, which I don't like. I think Mikulas' version
>   would be more complex to analyse at runtime. Also, much of the complexity
>   comes from the extra lock, which as I said fixes a bug.
> 
> Any additions or disputes? :)
> 
> --
> This patch fixes fsync starvation problems in the presence of concurrent
> dirtiers.
> 
> To take an extreme example: if thread A calls fsync on a file with one dirty
> page, at index 1 000 000; at the same time, thread B starts dirtying the
> file from offset 0 onwards.
> 
> Thead B perhaps will be allowed to dirty 1 000 pages before hitting its dirty
> threshold, then it will start throttling. Thread A will start writing out B's
> pages. They'll proceed more or less in lockstep until thread B finishes
> writing.
> 
> While these semantics are correct, we'd prefer a more timely notification that
> pages dirty at the time fsync was called are safely on disk. In the above
> scenario, we may have to wait until many times the machine's RAM capacity has
> been written to disk before the fsync returns success. Ideally, thread A would
> write the single page at index 1 000 000, then return.
> 
> This patch introduces a new pagecache tag, PAGECACHE_TAG_FSYNC. Data integrity
> syncs then start by looking through the pagecache for pages which are DIRTY
> and/or WRITEBACK within the requested range, and tagging all those as FSYNC.
> 
> Subsequent writeout and wait phases need then only look up those pages in the
> pagecache which are tagged with PAGECACHE_TAG_FSYNC.
> 
> After the sync operation has completed, the FSYNC tags are removed from the
> radix tree. This design requires exclusive usage of the FSYNC tags for the
> duration of a sync operation, so a lock on the address space is required.
> 
> For simplicity, I have removed the "don't wait for writeout if we hit -EIO"
> logic from a couple of places. I don't know if this is really worth the added
> complexity (EIO will still get reported, but it will just take a bit longer;
> an app can't rely in specific behaviour or timeliness here).
> 
> This lock also solves a real data integrity problem that I only noticed as
> I was writing the livelock avoidance code. If we consider the lock as the
> solution to this bug, this makes the livelock avoidance code much more
> attractive because then it does not introduce the new lock.
> 
> The bug is that fsync errors do not get propogated back up to the caller
> properly in some cases. Consider where we write a page in the writeout path,
> then it encounters an IO error and finishes writeback, in the meantime, another
> process (eg. via sys_sync, or another fsync) clears the mapping error bits.
> Then our fsync will have appeared to finish successfully, but actually should
> have returned error.

Has *anybody* *ever* complained about this behaviour?  I think maybe
one person after sixish years?

Why fix it?

>
> ...
>
> --- linux-2.6.orig/mm/filemap.c
> +++ linux-2.6/mm/filemap.c
> @@ -147,6 +147,28 @@ void remove_from_page_cache(struct page
>  	spin_unlock_irq(&mapping->tree_lock);
>  }
>  
> +static int sleep_on_fsync(void *word)
> +{
> +	io_schedule();
> +	return 0;
> +}
> +
> +void mapping_fsync_lock(struct address_space *mapping)
> +{
> +	wait_on_bit_lock(&mapping->flags, AS_FSYNC_LOCK, sleep_on_fsync,
> +							TASK_UNINTERRUPTIBLE);
> +	WARN_ON(mapping_tagged(mapping, PAGECACHE_TAG_FSYNC));
> +}
> +
> +void mapping_fsync_unlock(struct address_space *mapping)
> +{
> +	WARN_ON(mapping_tagged(mapping, PAGECACHE_TAG_FSYNC));
> +	WARN_ON(!test_bit(AS_FSYNC_LOCK, &mapping->flags));
> +	clear_bit_unlock(AS_FSYNC_LOCK, &mapping->flags);
> +	smp_mb__after_clear_bit();

hm, I wonder why clear_bit_unlock() didn't already do that.

The clear_bit_unlock() documentation is rather crappy.

> +	wake_up_bit(&mapping->flags, AS_FSYNC_LOCK);
> +}

It wouldn't hurt to document this interface a little bit.

>  static int sync_page(void *word)
>  {
>  	struct address_space *mapping;
> @@ -287,7 +309,64 @@ int wait_on_page_writeback_range(struct
>  
>  			/* until radix tree lookup accepts end_index */
>  			if (page->index > end)
> -				continue;
> +				break;
> +
> +			wait_on_page_writeback(page);
> +			if (PageError(page))
> +				ret = -EIO;
> +		}
> +		pagevec_release(&pvec);
> +		cond_resched();
> +	}
> +
> +	/* Check for outstanding write errors */
> +	if (test_and_clear_bit(AS_ENOSPC, &mapping->flags))
> +		ret = -ENOSPC;
> +	if (test_and_clear_bit(AS_EIO, &mapping->flags))
> +		ret = -EIO;
> +
> +	return ret;
> +}
> +
> +int wait_on_page_writeback_range_fsync(struct address_space *mapping,
> +				pgoff_t start, pgoff_t end)

We already have a wait_on_page_writeback_range().  The reader of your
code will be wondering why this variant exists, and how it differs. 
Sigh.

> +{
> +	struct pagevec pvec;
> +	int nr_pages;
> +	int ret = 0;
> +	pgoff_t index;
> +
> +	WARN_ON(!test_bit(AS_FSYNC_LOCK, &mapping->flags));
> +
> +	if (end < start)
> +		goto out;
> +
> +	pagevec_init(&pvec, 0);
> +	index = start;
> +	while ((index <= end) &&
> +			(nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
> +			PAGECACHE_TAG_FSYNC,
> +			min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1)) != 0) {
> +		unsigned i;
> +
> +		spin_lock_irq(&mapping->tree_lock);
> +		for (i = 0; i < nr_pages; i++) {
> +			struct page *page = pvec.pages[i];
> +
> +			/* until radix tree lookup accepts end_index */
> +			if (page->index > end)
> +				break;
> +
> +			radix_tree_tag_clear(&mapping->page_tree, page->index, PAGECACHE_TAG_FSYNC);

Don't mucky up the nice kernel code please.

> +		}
> +		spin_unlock_irq(&mapping->tree_lock);
> +
> +		for (i = 0; i < nr_pages; i++) {
> +			struct page *page = pvec.pages[i];
> +
> +			/* until radix tree lookup accepts end_index */
> +			if (page->index > end)
> +				break;
>  
>  			wait_on_page_writeback(page);
>  			if (PageError(page))
> @@ -303,6 +382,7 @@ int wait_on_page_writeback_range(struct
>  	if (test_and_clear_bit(AS_EIO, &mapping->flags))
>  		ret = -EIO;
>  
> +out:
>  	return ret;
>  }
>  
> @@ -325,18 +405,20 @@ int sync_page_range(struct inode *inode,
>  {
>  	pgoff_t start = pos >> PAGE_CACHE_SHIFT;
>  	pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
> -	int ret;
> +	int ret, ret2;
>  
>  	if (!mapping_cap_writeback_dirty(mapping) || !count)
>  		return 0;
> +	mutex_lock(&inode->i_mutex);

I am unable to determine why i_mutex is being taken here.

> +	mapping_fsync_lock(mapping);
>  	ret = filemap_fdatawrite_range(mapping, pos, pos + count - 1);
> -	if (ret == 0) {
> -		mutex_lock(&inode->i_mutex);
> +	if (ret == 0)
>  		ret = generic_osync_inode(inode, mapping, OSYNC_METADATA);
> -		mutex_unlock(&inode->i_mutex);
> -	}
> +	mutex_unlock(&inode->i_mutex);

Nor why it was dropped at this particular place.

> +	ret2 = wait_on_page_writeback_range_fsync(mapping, start, end);
>  	if (ret == 0)
> -		ret = wait_on_page_writeback_range(mapping, start, end);
> +		ret = ret2;
> +	mapping_fsync_unlock(mapping);
>  	return ret;
>  }
>  EXPORT_SYMBOL(sync_page_range);
> @@ -357,15 +439,18 @@ int sync_page_range_nolock(struct inode
>  {
>  	pgoff_t start = pos >> PAGE_CACHE_SHIFT;
>  	pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
> -	int ret;
> +	int ret, ret2;
>  
>  	if (!mapping_cap_writeback_dirty(mapping) || !count)
>  		return 0;
> +	mapping_fsync_lock(mapping);
>  	ret = filemap_fdatawrite_range(mapping, pos, pos + count - 1);
>  	if (ret == 0)
>  		ret = generic_osync_inode(inode, mapping, OSYNC_METADATA);
> +	ret2 = wait_on_page_writeback_range_fsync(mapping, start, end);
>  	if (ret == 0)
> -		ret = wait_on_page_writeback_range(mapping, start, end);
> +		ret = ret2;
> +	mapping_fsync_unlock(mapping);
>  	return ret;
>  }
>  EXPORT_SYMBOL(sync_page_range_nolock);
> @@ -389,23 +474,30 @@ int filemap_fdatawait(struct address_spa
>  }
>  EXPORT_SYMBOL(filemap_fdatawait);
>  
> +int filemap_fdatawait_fsync(struct address_space *mapping)
> +{
> +	loff_t i_size = i_size_read(mapping->host);
> +
> +	if (i_size == 0)
> +		return 0;
> +
> +	return wait_on_page_writeback_range_fsync(mapping, 0,
> +				(i_size - 1) >> PAGE_CACHE_SHIFT);
> +}

filemap_fdatawait() is documented...

>  int filemap_write_and_wait(struct address_space *mapping)
>  {
>  	int err = 0;
>  
>  	if (mapping->nrpages) {
> +		int err2;
> +
> +		mapping_fsync_lock(mapping);
>  		err = filemap_fdatawrite(mapping);
> -		/*
> -		 * Even if the above returned error, the pages may be
> -		 * written partially (e.g. -ENOSPC), so we wait for it.
> -		 * But the -EIO is special case, it may indicate the worst
> -		 * thing (e.g. bug) happened, so we avoid waiting for it.
> -		 */
> -		if (err != -EIO) {
> -			int err2 = filemap_fdatawait(mapping);
> -			if (!err)
> -				err = err2;
> -		}
> +		err2 = filemap_fdatawait_fsync(mapping);
> +		if (!err)
> +			err = err2;
> +		mapping_fsync_unlock(mapping);
>  	}
>  	return err;
>  }
> @@ -428,16 +520,16 @@ int filemap_write_and_wait_range(struct
>  	int err = 0;
>  
>  	if (mapping->nrpages) {
> -		err = __filemap_fdatawrite_range(mapping, lstart, lend,
> -						 WB_SYNC_ALL);
> -		/* See comment of filemap_write_and_wait() */
> -		if (err != -EIO) {
> -			int err2 = wait_on_page_writeback_range(mapping,
> +		int err2;
> +
> +		mapping_fsync_lock(mapping);
> +		err = filemap_fdatawrite_range(mapping, lstart, lend);
> +		err2 = wait_on_page_writeback_range_fsync(mapping,
>  						lstart >> PAGE_CACHE_SHIFT,
>  						lend >> PAGE_CACHE_SHIFT);
> -			if (!err)
> -				err = err2;
> -		}
> +		if (!err)
> +			err = err2;
> +		mapping_fsync_unlock(mapping);
>  	}
>  	return err;
>  }
>
> ...
>
> --- linux-2.6.orig/mm/page-writeback.c
> +++ linux-2.6/mm/page-writeback.c
> @@ -872,9 +872,11 @@ int write_cache_pages(struct address_spa
>  	pgoff_t index;
>  	pgoff_t end;		/* Inclusive */
>  	pgoff_t done_index;
> +	unsigned int tag = PAGECACHE_TAG_DIRTY;
>  	int cycled;
>  	int range_whole = 0;
>  	long nr_to_write = wbc->nr_to_write;
> +	int sync = wbc->sync_mode != WB_SYNC_NONE;
>  
>  	if (wbc->nonblocking && bdi_write_congested(bdi)) {
>  		wbc->encountered_congestion = 1;
> @@ -897,13 +899,40 @@ int write_cache_pages(struct address_spa
>  			range_whole = 1;
>  		cycled = 1; /* ignore range_cyclic tests */
>  	}
> +
> +	if (sync) {
> +		WARN_ON(!test_bit(AS_FSYNC_LOCK, &mapping->flags));

hm.  Is fsync the only caller of write_cache_pages(!WB_SYNC_NONE)? 
Surprised.

> +		/*
> +		 * If any pages are writeback or dirty, mark them fsync now.
> +		 * These are the pages we need to wait in in order to meet our

s/in/on/

> +		 * data integrity contract.
> +		 *
> +		 * Writeback pages need to be tagged, so we'll wait for them
> +		 * at the end of the writeout phase. However, the lookup below
> +		 * could just look up pages which are _DIRTY AND _FSYNC,
> +		 * because we don't care about them for the writeout phase.
> +		 */
> +		spin_lock_irq(&mapping->tree_lock);
> +		if (!radix_tree_gang_tag_set_if_tagged(&mapping->page_tree,
> +							index, end,
> +				(1UL << PAGECACHE_TAG_DIRTY) |
> +				(1UL << PAGECACHE_TAG_WRITEBACK),
> +				(1UL << PAGECACHE_TAG_FSYNC))) {

ooh, so that's what that thing does.

> +			/* nothing tagged */
> +			spin_unlock_irq(&mapping->tree_lock);
> +			return 0;

Can we please avoid the deeply-nested-return hand grenade?

> +		}
> +		spin_unlock_irq(&mapping->tree_lock);
> +		tag = PAGECACHE_TAG_FSYNC;
> +	}
> +
>  retry:
>  	done_index = index;
>  	while (!done && (index <= end)) {
>  		int i;
>  
>  		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
> -			      PAGECACHE_TAG_DIRTY,
> +			      tag,
>  			      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
>  		if (nr_pages == 0)
>  			break;
> @@ -951,7 +980,7 @@ continue_unlock:
>  			}
>  
>  			if (PageWriteback(page)) {
> -				if (wbc->sync_mode != WB_SYNC_NONE)
> +				if (sync)
>  					wait_on_page_writeback(page);
>  				else
>  					goto continue_unlock;
> @@ -981,7 +1010,7 @@ continue_unlock:
>  				}
>   			}
>  
> -			if (wbc->sync_mode == WB_SYNC_NONE) {
> +			if (!sync) {
>  				wbc->nr_to_write--;
>  				if (wbc->nr_to_write <= 0) {
>  					done = 1;
> Index: linux-2.6/drivers/usb/gadget/file_storage.c
> ===================================================================
> --- linux-2.6.orig/drivers/usb/gadget/file_storage.c
> +++ linux-2.6/drivers/usb/gadget/file_storage.c
> @@ -1873,13 +1873,15 @@ static int fsync_sub(struct lun *curlun)
>  
>  	inode = filp->f_path.dentry->d_inode;
>  	mutex_lock(&inode->i_mutex);
> +	mapping_fsync_lock(mapping);

Dood. Do `make allmodconfig ; make'

>  	rc = filemap_fdatawrite(inode->i_mapping);
>  	err = filp->f_op->fsync(filp, filp->f_path.dentry, 1);
>  	if (!rc)
>  		rc = err;
> -	err = filemap_fdatawait(inode->i_mapping);
> +	err = filemap_fdatawait_fsync(inode->i_mapping);
>  	if (!rc)
>  		rc = err;
> +	mapping_fsync_unlock(mapping);
>
> ...
>

I won't apply this because of the build breakage.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html