Re: [PATCH 5/5] xfs: log file size updates at I/O completion time

Christoph Hellwig <hch@xxxxxxxxxxxxx> · Thu, 17 Nov 2011 03:03:04 -0500

On Thu, Nov 17, 2011 at 11:29:15AM +1100, Dave Chinner wrote:
> Actually, that is not true: we know that transaction reservation in
> Io completion can hang systems. That's why we specifically shifted
> unwritten extent conversion to it's own workqueue.

> The condition that I saw most frequently was transaction reservation
> blocking and preventing IO from completing and therefore marking
> memory clean and freeable, whilst the unwritten extent conversion
> transaction reservation was blocked because the tail of the log
> could not be moved due to memory allocation for inode buffers to
> write back the inodes pinning the tail of the log failing and hence
> looping endlessly in the buffer cache. If we do transaction
> reservations for every single extending IO completion during Io
> completion, we are effectively making the deadlock possible for most
> extending delayed allocation write IOs (i.e. the majority of write
> IO in a typical system).

That fact has changed:

commit 80641dc66a2d6dfb22af4413227a92b8ab84c7bb
Author: Christoph Hellwig <hch@xxxxxxxxxxxxx>
Date:   Mon Oct 19 04:00:03 2009 +0000

    xfs: I/O completion handlers must use NOFS allocations

    When completing I/O requests we must not allow the memory allocator to
    recurse into the filesystem, as we might deadlock on waiting for the
    I/O completion otherwise.  The only thing currently allocating normal
    GFP_KERNEL memory is the allocation of the transaction structure for
    the unwritten extent conversion.  Add a memflags argument to
    _xfs_trans_alloc to allow controlling the allocator behaviour.

So unwritten extent conversion will never recurse back into the
filesystem any more.  If we still care about blocking on other things
(and I've hit this code fairly hard over the last months) we can
increase the per-cpu workqueue concurrency limit.  Currently xfs
limits each of the workqueues to one item, but the default (when
passing 0) would be 512.  With that we could probably even get away
with one per-filesystem workqueue if we want, but I'd really prefer
to get the size logging fix out to users ASAP instead of doing all
the fine-tuning now.

> > In addition to that implementing log reservations from ->writepage that
> > are only released by a different thread requires a lot of work,
> 
> That's trivial to do - there's nothing thread specific about a
> transaction. All you need to do is ensure that everything is logged
> before you pass the struct xfs_trans off to another thread. We
> can do that simply by adding it to the ioend.

The transaction set PF_FSTRANS in current->flags, which also includes
saving and restoring the current flags in (I can't figure out exactly
why we do the later).  Without this we trivially deadlock - I tried
to fix it, but it either didn't quite work or I ran into more deadlocks
through.

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs