Re: [PATCH] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 20 Aug 2019 11:04:55 +1000

On Mon, Aug 19, 2019 at 10:28:09AM -0400, Brian Foster wrote:
> On Mon, Aug 19, 2019 at 08:20:17PM +1000, Dave Chinner wrote:
> > On Sat, Aug 17, 2019 at 09:20:06AM -0400, Brian Foster wrote:
> > > On Sat, Aug 17, 2019 at 11:40:23AM +1000, Dave Chinner wrote:
> > > > I like this patch because it means we are starting to reach the
> > > > end-game of this architectural change.  This patch indicates that
> > > > people are starting to understand the end goal of this work: to
> > > > break up big transactions into atomic chains of smaller, simpler
> > > > linked transactions.  And they are doing so without needing to be
> > > > explicitly told "this is how we want complex modifications to be
> > > > done". This is _really good_. :)
> > > > 
> > > > And that leads me to start thinking about the next step after that,
> > > > which I'd always planned it to be, and that is async processing of
> > > > the "atomic multi-transaction operations". That, at the time, was
> > > > based on the observation that we had supercomputers with thousands
> > > > of CPUs banging on the one filesystem and we always had CPUs to
> > > > spare. That's even more true these days: lots of filesytem
> > > > operations still single threaded so we have huge amounts of idle CPU
> > > > to spare. We could be using that to speed up things like rsync,
> > > > tarball extraction, rm -rf, etc.
> > > > 
> > > 
> > > I haven't read back through the links yet, but on a skim the "async"
> > > part of this sounds like a gap in what is described in the sections
> > > referenced above (which sounds more like changing log formats to
> > > something more logical than physical). I'm pretty familiar with all of
> > > the dfops bits to this point, the async bit is what I'm asking about...
> > > 
> > > What exactly are you thinking about making async that isn't already? Are
> > > you talking about separating in-core changes from backend
> > > ordering/logging in general and across the board?
> > 
> > Yup, separating the work we have to do from the process context that
> > needs it to be done.
> > 
> > Think about a buffered write. All we need to do in process context
> > is reserve space and copy the data into the kernel. The rest of it
> > is done asynchornously in the background, and can be expedited by
> > fsync().
> > 
> 
> Yep, makes sense.
> 
> > Basically applying that to create, rename, etc. It's more complex
> > because we have to guarantee ordering of operations, but
> > fundamentally there is nothing stopping us from doing something liek
> > this on create:
> > 
> 
> Right, my next big question was going to be around maintaining/enforcing
> ordering (if necessary) around whatever operations end up async
> deferred. I didn't want to go there because I wasn't really clear on the
> goal.

Stuff like ordering of operations are implementation issues, not
really anything to do with the big picture. Big picture says "user
visible ordering should be maintained", how to achieve that with
async processing is something we'll work out how deal with when we
actually start prototyping async functionality. That's a long way
off yet, so you're not going to get specific answers to "how do we
implement this specific thing" - the best you'll get is "sequence
numbering of some sort"....

> > here's a synchronous create, but with async transaction processing:
> > 
> > 	DEFINE_WAIT(wait);
> > 
> > 	trans alloc
> > 	lock dir inode
> > 	log intent {
> > 		dir = dp
> > 		op = file create
> > 		name = <xfs_name>
> > 		mode = mode
> > 		wait = wait
> > 	}
> > 	xfs_defer_finish(intent, wait)
> > 		-> commits intent
> > 		-> punts rest of work to worker thread
> > 			-> when all is done, will wakeup(wait)
> > 		-> sleeps on wait
> > 	unlock dir
> > 
> > This could eventually become an async create by restructuring it
> > kinda like this:
> > 
> > 	ip = xfs_inode_alloc();
> > 
> > 	<initialise and set up inode, leave XFS_INEW/I_NEW set>
> > 
> > 	grab dir sequence number
> > 	trans alloc
> > 	log intent {
> > 		dir = dp
> > 		seq = dir_seq
> > 		op = file create
> > 		name = <xfs_name>
> > 		mode = mode
> > 		ip = ip
> > 	}
> > 	xfs_defer_finish(intent)
> > 		-> commits intent
> > 		-> punts rest of creation work to worker thread
> > 			when complete, will clear XFS_INEW/I_NEW
> > 
> > 	return instantiated inode to caller
> > 
> 
> So for this example, is this async to userspace or to some intermediate
> level in XFS to facilitate creation batching or some such?

What needs to be done async is determined by the initial intent
setup and then xfs_defer_finish() will run what needs to be done
sync to keep userspace happy and defer everything else.  e.g. we
might need to run dirent mods sync so lookup/readdir behave
appropriately, but then everything else can be deferred to an async
context. 

Don't ask for specifics, because there aren't any. General answers
is all you are going to get at this point because that's all the
high level architecture provides. The details of any specific answer
will change as we slowly change the code to make use of new in
memory tracking structures and fine-grained intent based operations.

> It sounds
> kind of like a delayed inode allocation mechanism, though that might be
> a challenge to expose to userspace if we don't do enough to at least
> acquire a physical inode number.

That's not a new problem - the VFS already solves that problem
on lookup with the I_NEW flag. i.e. new references will block until
the inode is fully instantiated and the I_NEW flag is cleared. We
also have XFS_INEW to do the same thing internally in XFS. IOWs, we
already have infrastructure to block async lookups if we need
information that is only instantiately during allocation....

> > Anyone one who looks this inode up after creation will block
> > on XFS_INEW/I_NEW flag bits. The caller that created the inode
> > will be able to operate on it straight away....
> > 
> > SO converting to async processing is really requires several steps.
> > 
> > 	1. convert everything to intent logging and defer
> > 	   operations
> 
> The approach makes sense, this all sounds like mostly mechanism.

I wasn't detailing the high level design - I was describing the
process we'd need to undertake to get to async processing to
indicate how far off it actually is. :)

> What's
> still lacking is a high level design description to justify large scale
> developmental decisions like "converting everything to intent logging."
> Particularly since whatever new intents we create will need to be
> supported indefinitely once they're released in the wild. Note that I'm
> sure you're working/thinking in design -> implement mode and can see
> some of these decisions more clearly, but the rest of us shouldn't have
> to piece together the design story from implementation details. ;)

What justification is needed? We went through all that years ago -
I've got historical records from XFS meetings in 2004 that document
all these problems and many of the potential solutions. e.g. delayed
logging is one of the things brought up in those documents. IOWs,
we've proven time and time again that the limitation of
metadata performance in XFS is physical logging, and using
fine-grained intent logging has long been seen as a solution to the
problem for almost as long.

That is reflected in the title of the document I pointed you at:
"Improving metadata performance by reducing journal overhead". That
whole section of the document at is based around one thing -
improving journalling efficiency by reducing the journalling
overhead.

The biggest issue we have is physical journalling is expensive in
terms of CPU, memory and IO bandwidth. Delayed logging knocked IO
bandwidth down by a factor of 10x, but it came at the cost of
substantial additional CPU and more than doubled the memory
footprint of the transaction subsystem. That's because it added
another layer of buffering (2 layers if you count shadow buffers)
to aggregate all the physical changes that have been made to
objects.

You can see how this affects performance by comparing 4kB block size
performance with 64kB block size performance. 4kB is far faster
because physical logging results in a 10x increase log IO bandwidth
for the same operations, a substantatial increase in memory
footprint and a huge increase in CPU usage as a result of larger
contiguous allocations and memcpy()s when logging changes. So
despite the fact we do less IOPS, the added overhead of everything
means 64kB block size only acheives 30-50% of the performance of 4kB
block size filesystems under heavy metadata modification workloads.

IOWs, the performance of the filesystem is still largely limited by
the physical logging of metadata in the journalling subsystem, and
the only way around that is to avoid physical logging of objects.

That's the architectural problem that intent logging solves. We move
away from the CPU, memory and bandwidth requirements of physical
logging to create atomic transactions and replace it with small,
logical changes that are chained together to create larger atomic
operations.

I mentioned icreate earlier as an example - perhaps adding numbers
to this will make it clear the gains we can realise via intent
logging. Creating an inode chunk on a v5 filesystem with physical
logging requires 64 inode core regions to be physically logged.
In terms of log item size for a single inode chunk:

	size = 64 * (2 * 128 + sizeof(ophdr))
	     = ~20kB per inode chunk

This requires a 20kB allocation, a heap of CPU processing to walk
the format bitmap multiple times, and then 20kB of log space to
write it.

Now do 100,000 inode creates/s - that's 15,000 inode clusters, so we
are talking 15,000 * 20kB = 300MB/s log bandwidth for the physical
inode clusters alone. And delayed logging can't mitigate this
because we don't relog inode buffers. So the icreate item replaced
this 20kB of physical log bandwidth with about 50 bytes of log
bandwidth and almost no CPU or log item memory overhead on during
transaction commit.

On my 32p test machine, I'm seeing inode creates rates of over
500,000/s, and that's using maybe 200MB/s of log bandwidthi with
icreate. Without the icreate transaction, we'd need 1.5-2GB/s of log
bandwidth to get this sort of performance.

It should be clear from this example just how much overhead we can
avoid by logging intents rather than physical changes. This is not
speculation, intent logging benefits are well proven theory....

> For example, even with the examples discussed here it's not clear to me
> we need to define an intent for unlinked list removal. We've already
> discussed methods to make removal further asynchronous than it already
> is at a higher level that doesn't depend on asynchronous transaction
> processing at all. Even if unlinked removal did end up deferred in
> certain use cases, that should be dictated by redesign of particular
> higher level operations (which may or may not require that granular of
> an intent).

The part of inode unlink Darrick is working on making async is the
part when userspace has dropped it's final reference to the unlinked
inode. i.e. after userspace is done with it. The unlinked inode is
in reclaim at that point, and we can do what we want with it. i.e.
it's an inode reclaim optimisation, not something that userspace
will ever see or notice.

That is, the actual part of unlinking the inode from the directory
and adding it to the AGI unlinked list is still synchonous, and
still blocks userspace. We want that part async, too, so we can get
all the entries removed from directories and the directory unlinked
before we start processing any of the inode unlinks...

And, FWIW, AGI unlinked list addition/removal should really be made
an intent rather than physically logging the iunlink field in the
inode buffer.

> > e.g. rm -rf becomes "load all the inodes into memory as we log
> > dirent removal, when the dir unlink is logged, truncate the dir
> > inode they are all gone. Sort all the inodes into same cluster/chunk
> > groups, free all the inodes in a single inobt/finobt record
> > update...."
> > 
> 
> This sounds like a great example to run through a design description for
> or (even better) an RFC. It clearly depends on some mechanical things
> like conversion of some operations to use intents and enhancements to
> deferred processing to provide an async execution context. It also may
> depend on some less mechanical things like ordering rules and a batching
> detection heuristic, but things like that can be easily fudged for an
> RFC.

Sure, that will all come in time. You're asking about things that
are still a few years away from being realised. There's still a lot
of work to do before we get anywhere near the "lets start making
things async and batching" point in time.

> It also presents an opportunity to demonstrate value. While this all
> sounds like a nice approach in theory, there are still interesting
> questions like how much this improves performance over the aggregation
> provided by delayed logging, what object locking looks like across the
> new separation between process transaction processing and deferred/async
> transaction processing, whether intent granularity needs to
> fundamentally change from the approach we've taken so far (that hasn't
> been with consideration for async processing, etc).

Intent based logging is easily an order of magnitude, maybe two
orders of magnitude more efficient than physical logging on for 64kB
block size filesystems. You can see from above the gains that are
realised, and there is plenty of other filesystems and research out
there that demonstrate the efficiency gains associated with logical
intent logging.  IOWs, the path we are already on (moving to intent
based logging) is well grounded in both theory and reality, and so
it really does not need any further justification.

The async processing side of things is the next step beyond - saying
that "we've got a lot of things to consider here" is stating the
bleeding obvious. It's the big picture goal I've had in mind for a
long time, but there's no point in getting into any sort of detailed
design while we are still several steps away from having a platform
we can start actually prototyping functionality on. 

Hence I haven't spend any time on trying to document it because we
haven't set all the foundations it needs in concrete yet. That we
are starting to talk about it is indicative that we're getting close
to needing a new set of "Ideas for XFS" that lay out this stuff in a
more concrete fashion, but we aren't quite there yet....

And, realistically, we have to be aware that async processing by
itself may not realise any gains at all. The gains I see being
significant come from batching multiple async modifications into
single operations, not from a direct conversion to async processing
But we can't do batching if we don't have infrastructure that allows
large numbers of operations to be performed in an asynchronous
manner.

That's where comparisons with delayed logging falls down. Delayed
logging is possible because it sits behind an interface that allows
async processing of operations. i.e. The CIL is the batch processing
mechanism built underneath an async interface, it's not an async
mechanism itself. We've got to add the equivalent of the async
transaction commit interface before we can even think about how
batching operations could be done. Async first, batching later.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx