Re: [PATCH RESEND v2 01/18] xfs: Fix multi-transaction larp replay

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 16 Aug 2022 10:54:38 +1000

On Thu, Aug 11, 2022 at 06:55:16PM -0700, Alli wrote:
> On Wed, 2022-08-10 at 16:12 +1000, Dave Chinner wrote:
> > On Tue, Aug 09, 2022 at 10:01:49PM -0700, Alli wrote:
> > > On Wed, 2022-08-10 at 11:58 +1000, Dave Chinner wrote:
> > > > On Tue, Aug 09, 2022 at 09:52:55AM -0700, Darrick J. Wong wrote:
> > > > > On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison Henderson
> > > > > wrote:
> > > > > > Recent parent pointer testing has exposed a bug in the
> > > > > > underlying
> > > > > > attr replay.  A multi transaction replay currently performs a
> > > > > > single step of the replay, then deferrs the rest if there is
> > > > > > more
> > > > > > to do.
> > > > 
> > > > Yup.
> > > > 
> > > > > > This causes race conditions with other attr replays that
> > > > > > might be recovered before the remaining deferred work has had
> > > > > > a
> > > > > > chance to finish.
> > > > 
> > > > What other attr replays are we racing against?  There can only be
> > > > one incomplete attr item intent/done chain per inode present in
> > > > log
> > > > recovery, right?
> > > No, a rename queues up a set and remove before committing the
> > > transaction.  One for the new parent pointer, and another to remove
> > > the
> > > old one.
> > 
> > Ah. That really needs to be described in the commit message -
> > changing from "single intent chain per object" to "multiple
> > concurrent independent and unserialised intent chains per object" is
> > a pretty important design rule change...
> > 
> > The whole point of intents is to allow complex, multi-stage
> > operations on a single object to be sequenced in a tightly
> > controlled manner. They weren't intended to be run as concurrent
> > lines of modification on single items; if you need to do two
> > modifications on an object, the intent chain ties the two
> > modifications together into a single whole.
> > 
> > One of the reasons I rewrote the attr state machine for LARP was to
> > enable new multiple attr operation chains to be easily build from
> > the entry points the state machien provides. Parent attr rename
> > needs a new intent chain to be built, not run multiple independent
> > intent chains for each modification.
> > 
> > > It cant be an attr replace because technically the names are
> > > different.
> > 
> > I disagree - we have all the pieces we need in the state machine
> > already, we just need to define separate attr names for the
> > remove and insert steps in the attr intent.
> > 
> > That is, the "replace" operation we execute when an attr set
> > overwrites the value is "technically" a "replace value" operation,
> > but we actually implement it as a "replace entire attribute"
> > operation.
> > 
> > Without LARP, we do that overwrite in independent steps via an
> > intermediate INCOMPLETE state to allow two xattrs of the same name
> > to exist in the attr tree at the same time. IOWs, the attr value
> > overwrite is effectively a "set-swap-remove" operation on two
> > entirely independent xattrs, ensuring that if we crash we always
> > have either the old or new xattr visible.
> > 
> > With LARP, we can remove the original attr first, thereby avoiding
> > the need for two versions of the xattr to exist in the tree in the
> > first place. However, we have to do these two operations as a pair
> > of linked independent operations. The intent chain provides the
> > linking, and requires us to log the name and the value of the attr
> > that we are overwriting in the intent. Hence we can always recover
> > the modification to completion no matter where in the operation we
> > fail.
> > 
> > When it comes to a parent attr rename operation, we are effectively
> > doing two linked operations - remove the old attr, set the new attr
> > - on different attributes. Implementation wise, it is exactly the
> > same sequence as a "replace value" operation, except for the fact
> > that the new attr we add has a different name.
> > 
> > Hence the only real difference between the existing "attr replace"
> > and the intent chain we need for "parent attr rename" is that we
> > have to log two attr names instead of one. 
> 
> To be clear, this would imply expanding xfs_attri_log_format to have
> another alfi_new_name_len feild and another iovec for the attr intent
> right?  Does that cause issues to change the on disk log layout after
> the original has merged?  Or is that ok for things that are still
> experimental? Thanks!

I think we can get away with this quite easily without breaking the
existing experimental code.

struct xfs_attri_log_format {
        uint16_t        alfi_type;      /* attri log item type */
        uint16_t        alfi_size;      /* size of this item */
        uint32_t        __pad;          /* pad to 64 bit aligned */
        uint64_t        alfi_id;        /* attri identifier */
        uint64_t        alfi_ino;       /* the inode for this attr operation */
        uint32_t        alfi_op_flags;  /* marks the op as a set or remove */
        uint32_t        alfi_name_len;  /* attr name length */
        uint32_t        alfi_value_len; /* attr value length */
        uint32_t        alfi_attr_filter;/* attr filter flags */
};

We have a padding field in there that is currently all zeros. Let's
make that a count of the number of {name, value} tuples that are
appended to the format. i.e.

struct xfs_attri_log_name {
        uint32_t        alfi_op_flags;  /* marks the op as a set or remove */
        uint32_t        alfi_name_len;  /* attr name length */
        uint32_t        alfi_value_len; /* attr value length */
        uint32_t        alfi_attr_filter;/* attr filter flags */
};

struct xfs_attri_log_format {
        uint16_t        alfi_type;      /* attri log item type */
        uint16_t        alfi_size;      /* size of this item */
	uint8_t		alfi_attr_cnt;	/* count of name/val pairs */
        uint8_t		__pad1;          /* pad to 64 bit aligned */
        uint16_t	__pad2;          /* pad to 64 bit aligned */
        uint64_t        alfi_id;        /* attri identifier */
        uint64_t        alfi_ino;       /* the inode for this attr operation */
	struct xfs_attri_log_name alfi_attr[]; /* attrs to operate on */
};

Basically, the size and shape of the structure has not changed, and
if alfi_attr_cnt == 0 we just treat it as if alfi_attr_cnt == 1 as
the backwards compat code for the existing code.

And then we just have as many followup regions for name/val pairs
as are defined by the alfi_attr_cnt and alfi_attr[] parts of the
structure. Each attr can have a different operation performed on
them, and they can have different filters applied so they can exist
in different namespaces, too.

SO I don't think we need a new on-disk feature bit for this
enhancement - it definitely comes under the heading of "this stuff
is experimental, this is the sort of early structure revision that
EXPERIMENTAL is supposed to cover....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx