Re: [PATCH] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag

Dave Chinner <david@xxxxxxxxxxxxx> · Sat, 17 Aug 2019 11:40:23 +1000

On Fri, Aug 16, 2019 at 10:53:10AM -0400, Brian Foster wrote:
> On Fri, Aug 16, 2019 at 04:09:39PM +0800, kaixuxia wrote:
> > 
> > 
> > On 2019/8/16 7:36, Dave Chinner wrote:
> > > On Tue, Aug 13, 2019 at 07:17:33PM +0800, kaixuxia wrote:
> > > > In this patch we make the unlinked list removal a deferred operation,
> > > > i.e. log an iunlink remove intent and then do it after the RENAME_WHITEOUT
> > > > transaction has committed, and the iunlink remove intention and done
> > > > log items are provided.
> > > 
> > > I really like the idea of doing this, not just for the inode unlink
> > > list removal, but for all the high level complex metadata
> > > modifications such as create, unlink, etc.
> > > 
> > > The reason I like this is that it moves use closer to being able to
> > > do operations almost completely asynchronously once the first intent
> > > has been logged.
> > > 
> > 
> > Thanks a lot for your comments.
> > Yeah, sometimes the complex metadata modifications correspond to the
> > long and complex transactions that hold more locks or other common
> > resources, so the deferred options may be better choices than just
> > changing the order in one transaction.
> > 
> 
> I can't speak for Dave (who can of course chime in again..) or others,
> but I don't think he's saying that this approach is preferred to the
> various alternative approaches discussed in the other subthread. Note
> that he also replied there with another potential solution that doesn't
> involve deferred operations.

Right, two separate things. One: fixing the bug doesn't require
deferred operations.

Two: async deferred operations is the direction we've been heading
in for a long, long time.

> Rather, I think he's viewing this in a much longer term context around
> changing more of the filesystem to be async in architecture.

Right, in terms of longer term context.

> Personally,
> I'd have a ton more questions around the context of what something like
> that looks like before I'd support starting to switch over less complex
> operations to be deferred operations based on the current dfops
> mechanism.
>
> The mechanism works and solves real problems, but it also has
> tradeoffs that IMO warrant the current model of selective use. Further,
> it's nearly impossible to determine what other fundamental
> incompatibilities might exist without context on bigger picture design.

The "bigger picture" takes up a lot of space in my head, and it has
for a long time. However, here you are worrying about implementation
details around the dfops mechanisms - that's not big picture
thinking.

Big picture thinking is about how all the pieces fit together, not
how a specific piece of the picture is implemented. The design and
implementation of the dfops mechanism is going to change over time,
but the architectural function it performs will not change.

The architectural problem the "intent and deferral" mechanism solves
is that XFS's original "huge complex transaction" model broke down
when we started trying to add more functionality to each individual
transaction. This first came to light in the late 90s, when HSMs
required attributes and attributes could not be added atomically in
creation operations. So all sorts of problems occurred on crashes,
which mean HSMs had to scan filesytsems after a crash to find files
with inconsistent attributes (hence bulkstat!). The problem still
exists today with security attributes, default acls, etc.

And then we started wanting to add parent pointers. Which require
atomic manipulation of attributes in directory modification
transactions. Oh dear.  And then came desires to add rmap, which
needed their own atomic additions to every transaction in the
filesysetm that allocated or freed space. And then reflink, with
it's requirements.

The transaction model basically broke down - it couldn't do what we
needed.  You can see some of the ideas I had more than 10 years ago
about how we'd need to morph XFS to support the flexibility in
transactional modifications we needed here:

http://xfs.org/index.php/Improving_Metadata_Performance_By_Reducing_Journal_Overhead#Operation_Based_Logging
http://xfs.org/index.php/Improving_Metadata_Performance_By_Reducing_Journal_Overhead#Atomic_Multi-Transaction_Operations

The "operation based logging" mechanism is essentially how we are
using intents in deferred operations. Another example is the icreate
item, which just logs the location of the inode chunk we need
to initialise, rahter than logging the physical initialisation
directly.

The problem that Darrick solved with the deferred operations was the
"atomic multi-transaction operations" problem - i.e. how to link all
these smaller, individual atomic modifications into a much larger
fail-safe atomic operation without blowing out the log reservation
to cover every single possible change that could be made.

Now, keep in mind that the potential mechanisms/implementations I
talk about in those links are way out of date. It's the concepts and
direction - the bigger picture - that I'm demonstrating here. So
don't get stuck on "but that mechanism won't work", rather see it
for what it actually is - ideas for how we go from complex, massive
transactions to flexible agreggated chains of small, individual
intent-based transactions.

IOWs, dfops is just infrastructure to provide the intent chaining
functionality required by the "aggregated chains" modification
architecture. If we have to modify the dfops infrastructure to solve
problems along the way, then thats just fine. It's just a mechanism
we used to implement a piece of the bigger picture - dfops is not a
feature in the bigger picture at all.....

In terms of the bigger picture, the work Allison is doing to
re-architect the attribute manipulations around deferred operations
for parent pointers is breaking the new ground here. It's slow going
because it's the first major conversion to the "new way", but it's
telling us about all the things the dfops mechanism doesn't provide.
Conversions of other operations will be simpler as the dfops
infrastructure will be more capable as a result of the attribute
conversion.

But kind in mind that it is the conversion of attribute modification
to chained intents that is the big picture work here - dfops is just
the mechanism it uses. i.e. It's the conversion to the "operation
based logging + atomic multi-transaction" architecture that allows
us to add attribute modifications into directory operations and
guarantee the dir and attr mods are atomic.

>From that perspective, dfops is just the implementation mechanism that
makes the architectural big picture change possible. dfops will need
change and morph as necessary to support these changes, but those
changes are not architectural or big picture items - they are just
implementation details....

I like this patch because it means we are starting to reach the
end-game of this architectural change.  This patch indicates that
people are starting to understand the end goal of this work: to
break up big transactions into atomic chains of smaller, simpler
linked transactions.  And they are doing so without needing to be
explicitly told "this is how we want complex modifications to be
done". This is _really good_. :)

And that leads me to start thinking about the next step after that,
which I'd always planned it to be, and that is async processing of
the "atomic multi-transaction operations". That, at the time, was
based on the observation that we had supercomputers with thousands
of CPUs banging on the one filesystem and we always had CPUs to
spare. That's even more true these days: lots of filesytem
operations still single threaded so we have huge amounts of idle CPU
to spare. We could be using that to speed up things like rsync,
tarball extraction, rm -rf, etc.

I mapped out 10-15 years worth of work for XFS back in 2008, and
we've been regularly ticking off boxes on the implementation
checklist ever since. We're actually tracking fairly well on the
"done in 15yrs" timeline at the moment. Async operation is at the
end of that checklist...

What I'm trying to say is that the bigger picture here has been out
there for a long time, but you have to look past the trees to see
it. Hopefully now that I've pointed out the forest, it will be
easier to see :)

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx