Re: [PATCH] xfs: improve sync behaviour in face of aggressive dirtying

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 22 Jun 2011 11:09:11 +1000

On Tue, Jun 21, 2011 at 05:29:20AM -0400, Christoph Hellwig wrote:
> On Tue, Jun 21, 2011 at 10:33:43AM +1000, Dave Chinner wrote:
> > > The minor one is that we always flush all work items and not just those
> > > on the filesystem to be flushed.  This might become an issue for lager
> > > systems, or when we apply a similar scheme to fsync, which has the same
> > > underlying issue.
> > 
> > For sync, I don't think it matters if we flush a few extra IO
> > completions on a busy system. For fsync, it could increase fsync
> > completion latency significantly, though I'd suggest we should
> > address that problem when we apply the scheme to fsync.
> 
> > > The major one is that flush_workqueue only flushed work items that were
> > > queued before it was called, but we can requeue completions when we fail
> > > to get the ilock in xfs_setfilesize, which can lead to losing i_size
> > > updates when it happens.
> > 
> > Yes, I can see that will cause problems....
> > 
> > > I see two ways to fix this:  either we implement our own workqueue
> > > look-alike based on the old workqueue code.  This would allow flushing
> > > queues per-sb or even per-inode, and allow us to special case flushing
> > > requeues as well before returning.
> > 
> > No need for a look-alike.  With the CMWQ infrastructure, there is no
> > reason why we need global workqueues anymore. The log, data and
> > convert wqs were global to minimise the number of per-cpu threads
> > XFS required to operate. CMWQ prevents the explosion of mostly idle
> > kernel threads, so we could move all these workqueues to per- struct
> > xfs_mount without undue impact.
> 
> That helps to work around the impact of syncs on other filesystems, but
> it does not help with the EAGAIN requeueing.

Right. I wasn't suggesting that it would. I was just talking out
loud about why we have the current structure and how the reasons for
it existing weren't relevant anymore. Hence changing the workqueue
context is something we should be able to do with little impact....

> > I don't think we want to go to per-inode work contexts. One
> > possibility is that for fsync related writeback (e.g. WB_SYNC_ALL)
> 
> WB_SYNC_ALL is also used for the second pass of sync.

Sure, but that is effectively the same case as fsync as most of the
dirty pages would already be written by the WB_SYNC_NONE pass.
Regardless, all and extra fsync completion queue would mean is that
that sync would also need to flush the fsync-completion queue.....

> > we could have a separate "fsync-completion" wqs that we queue
> > completions to rather than the more widely used data workqueue. Then
> > for fsync we'd only need to flush the fsync-completion workqueue
> > rather than the mount wide data and convert wqs, and hence we
> > wouldn't stall on IO completion for IO outside of fsync scope...
> 
> One thing I've thought about is abusing current->journal_info to
> complete ioends in-processes for fsync.

Oh, now that is devious. I've never considered abusing that field
in the task context before ;)

> That will need Josefs patch
> to move the filemap_write_and_wait call into ->fsync.  At that point
> we can do:
> 
> xfs_fsync() {
> 
> 	current->journal_info = &ioend_end_list;
> 
> 	filemap_write_and_wait();
> 
> 	list_for_each_entry_reverse(ioend_end_list) {
> 		process_ioend();
> 	}
> 
> 	current->journal_info = NULL;
> 
> which means there's no context switch involved, and as an added benefit
> we processing the list reverse means we minimize the number of i_size
> updates, which will be especially interesting with my changes to always
> log these directly and get rid of the unlogged metadata changes.

All good, except I think there's a small problem with this - we have
to process the ioends before pages will transition from WRITEBACK to
clean. i.e. it is not until xfs_ioend_destroy() that we call the
bh->b_end_io() function to update the page state. Hence it would
have to be:

xfs_fsync() {

	current->journal_info = &ioend_end_list;

	filemap_fdatawrite();

	list_for_each_entry_reverse(ioend_end_list) {
		/* process_ioend also waits for ioend completion */
		process_ioend();
	}

	current->journal_info = NULL;

	filemap_fdatawait();
}

> It does not cover ioends always pending before we called fsync, but
> I hope we can live with that.

That is not a problem - those pages will be marked WRITEBACK so the
filemap_fdatawait() will catch them. All that will happen is we'll
take a latency hit when there is a concurrent cleaner running.

Direct IO is another matter, but we've already got an
xfs_ioend_wait() in xfs_fsync() to deal with that. Perhaps that
could be moved over to your new DIO counter so we do block on all
pending IO?

> What we still need to sort out with the workqueue scheme is a way to
> deal with the EAGAIN returns from xfs_setfilesize.  From reading your
> changelog for that change it seems we can simply kill it once we move
> to per-mount completion queues, as the loop device would have it's own
> workqueue. 

Yes, that was the primary issue it was intended to solve. per-mount
work queues was rejected at the time because of the thread explosion
it would cause. That's not a problem now ;)

I'm just trying to remember if there were other problems that the
requeue also solved. I think it was to do with blocking IO
completions when memory is low (ie. something it holding the ilock
but blocking on memory allocation), but that, once again, is solved
with CMWQ allowing multiple outstanding work items to be in
progress concurrently.

> If that sounds reasonable I'll respin a series to move to
> per-mount workqueues, remove the EAGAIN case, and use the workqueue
> flush in sync.  Fsync will be left for later, and I'll ping Josef to
> resend his fsync prototype change.

Yes, sounds like a plan.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs