On Thu, Apr 25, 2019 at 07:16:50AM -0400, Brian Foster wrote: > On Wed, Apr 24, 2019 at 10:47:47PM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > > > During testing of xfs/141 on a V4 filesystem, I observed some > > inconsistent behavior with regards to resources that are held (i.e. > > remain locked) across a defer roll. The transaction roll always gives > > the defer roll function a new transaction, even if committing the old > > transaction fails. However, the defer roll function only rejoins the > > held resources if the transaction commit succeedied. This means that > > callers of defer roll have to figure out whether the held resources are > > attached to the transaction being passed back. > > > > Worse yet, if the defer roll was part of a defer finish call, we have a > > third possibility: the defer finish could pass back a dirty transaction > > with dirty held resources and an error code. > > > > The only sane way to handle all of these scenarios is to require that > > the code that held the resource either cancel the transaction before > > unlocking and releasing the resources, or use functions that detach > > resources from a transaction properly (e.g. xfs_trans_brelse) if they > > need to drop the reference before committing or cancelling the > > transaction. > > > > In order to make this so, change the defer roll code to join held > > resources to the new transaction unconditionally and fix all the bhold > > callers to release the held buffers correctly. > > > > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > --- > > fs/xfs/libxfs/xfs_attr.c | 31 ++++++++++--------------------- > > fs/xfs/libxfs/xfs_attr.h | 2 +- > > fs/xfs/libxfs/xfs_defer.c | 14 +++++++++----- > > fs/xfs/xfs_dquot.c | 22 +++++++++++----------- > > 4 files changed, 31 insertions(+), 38 deletions(-) > > > > diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c > > index 2dd9ee2a2e08..ad954138243e 100644 > > --- a/fs/xfs/libxfs/xfs_attr.c > > +++ b/fs/xfs/libxfs/xfs_attr.c > > @@ -224,10 +224,10 @@ xfs_attr_try_sf_addname( > > */ > > int > > xfs_attr_set_args( > > - struct xfs_da_args *args, > > - struct xfs_buf **leaf_bp) > > + struct xfs_da_args *args) > > { > > struct xfs_inode *dp = args->dp; > > + struct xfs_buf *leaf_bp = NULL; > > int error; > > > > /* > > @@ -255,7 +255,7 @@ xfs_attr_set_args( > > * It won't fit in the shortform, transform to a leaf block. > > * GROT: another possible req'mt for a double-split btree op. > > */ > > - error = xfs_attr_shortform_to_leaf(args, leaf_bp); > > + error = xfs_attr_shortform_to_leaf(args, &leaf_bp); > > if (error) > > return error; > > > > @@ -264,22 +264,15 @@ xfs_attr_set_args( > > * concurrent AIL push cannot grab the half-baked leaf > > * buffer and run into problems with the write verifier. > > */ > > - xfs_trans_bhold(args->trans, *leaf_bp); > > + xfs_trans_bhold(args->trans, leaf_bp); > > > > error = xfs_defer_finish(&args->trans); > > - if (error) > > + if (error) { > > + xfs_trans_brelse(args->trans, leaf_bp); > > return error; > > + } > > Shouldn't we just unconditionally release the bhold here? Then we cover > the scenario where the buffer might somehow be dirty. Doh. Yeah, I missed that; thanks for catching it. > It also seems more > logically consistent since this code is now responsible only for the > bhold (to keep the buf locked and deal with the AIL writeback verifier > race thing). Beyond that, the final transaction can release the buffer > on commit or abort. E.g., something like the following..? > > /* > * Hold the buffer locked to prevent ... Release the hold to > * transfer the buf to the final tx. > */ > xfs_trans_bhold(args->trans, leaf_bp); > error = xfs_defer_finish(&args->trans); > xfs_trans_bhold_release(args->trans, leaf_bp); > if (error) > return error; > > BTW out of curiosity, did you happen to identify what deferred operation > was actually queued here that triggered a tx roll in your xfs/141 test? > Was it an allocation/agfl fixup or something? I only looked after the fact but it looked like it was an agfl fixup. The fs was a V4 filesystem, fwiw. > > > > - /* > > - * Commit the leaf transformation. We'll need another > > - * (linked) transaction to add the new attribute to the > > - * leaf. > > - */ > > - error = xfs_trans_roll_inode(&args->trans, dp); > > - if (error) > > - return error; > > - xfs_trans_bjoin(args->trans, *leaf_bp); > > - *leaf_bp = NULL; > > + xfs_trans_bhold_release(args->trans, leaf_bp); > > } > > > > if (xfs_bmap_one_block(dp, XFS_ATTR_FORK)) > > @@ -322,7 +315,6 @@ xfs_attr_set( > > int flags) > > { > > struct xfs_mount *mp = dp->i_mount; > > - struct xfs_buf *leaf_bp = NULL; > > struct xfs_da_args args; > > struct xfs_trans_res tres; > > int rsvd = (flags & ATTR_ROOT) != 0; > > @@ -381,9 +373,9 @@ xfs_attr_set( > > goto out_trans_cancel; > > > > xfs_trans_ijoin(args.trans, dp, 0); > > - error = xfs_attr_set_args(&args, &leaf_bp); > > + error = xfs_attr_set_args(&args); > > if (error) > > - goto out_release_leaf; > > + goto out_trans_cancel; > > if (!args.trans) { > > /* shortform attribute has already been committed */ > > goto out_unlock; > > @@ -408,9 +400,6 @@ xfs_attr_set( > > xfs_iunlock(dp, XFS_ILOCK_EXCL); > > return error; > > > > -out_release_leaf: > > - if (leaf_bp) > > - xfs_trans_brelse(args.trans, leaf_bp); > > out_trans_cancel: > > if (args.trans) > > xfs_trans_cancel(args.trans); > > diff --git a/fs/xfs/libxfs/xfs_attr.h b/fs/xfs/libxfs/xfs_attr.h > > index 2297d8467666..3b0dce06e454 100644 > > --- a/fs/xfs/libxfs/xfs_attr.h > > +++ b/fs/xfs/libxfs/xfs_attr.h > > @@ -140,7 +140,7 @@ int xfs_attr_get(struct xfs_inode *ip, const unsigned char *name, > > unsigned char *value, int *valuelenp, int flags); > > int xfs_attr_set(struct xfs_inode *dp, const unsigned char *name, > > unsigned char *value, int valuelen, int flags); > > -int xfs_attr_set_args(struct xfs_da_args *args, struct xfs_buf **leaf_bp); > > +int xfs_attr_set_args(struct xfs_da_args *args); > > int xfs_attr_remove(struct xfs_inode *dp, const unsigned char *name, int flags); > > int xfs_attr_remove_args(struct xfs_da_args *args); > > int xfs_attr_list(struct xfs_inode *dp, char *buffer, int bufsize, > > diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c > > index 94f00427de98..1c6bf2105939 100644 > > --- a/fs/xfs/libxfs/xfs_defer.c > > +++ b/fs/xfs/libxfs/xfs_defer.c > > @@ -274,13 +274,15 @@ xfs_defer_trans_roll( > > > > trace_xfs_defer_trans_roll(tp, _RET_IP_); > > > > - /* Roll the transaction. */ > > + /* > > + * Roll the transaction. Rolling always given a new transaction (even > > + * if committing the old one fails!) to hand back to the caller, so we > > + * join the held resources to the new transaction so that we always > > + * return with the held resources joined to @tpp, no matter what > > + * happened. > > + */ > > error = xfs_trans_roll(tpp); > > tp = *tpp; > > - if (error) { > > - trace_xfs_defer_trans_roll_error(tp, error); > > - return error; > > - } > > > > /* Rejoin the joined inodes. */ > > for (i = 0; i < ipcount; i++) > > @@ -292,6 +294,8 @@ xfs_defer_trans_roll( > > xfs_trans_bhold(tp, bplist[i]); > > } > > > > + if (error) > > + trace_xfs_defer_trans_roll_error(tp, error); > > return error; > > } > > > > diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c > > index 87e6dd5326d5..7f058120a921 100644 > > --- a/fs/xfs/xfs_dquot.c > > +++ b/fs/xfs/xfs_dquot.c > > @@ -359,9 +359,8 @@ xfs_dquot_disk_alloc( > > */ > > xfs_trans_bhold(tp, bp); > > error = xfs_defer_finish(tpp); > > - tp = *tpp; > > if (error) { > > - xfs_buf_relse(bp); > > + xfs_trans_brelse(*tpp, bp); > > return error; > > } > > Similar comment here wrt to the dirty buf case. This code means that if > the buffer remained dirty, we'd hold it to the tx, brelse will skip it > and we wouldn't have set *bpp before we return the error. The caller > cancels the tx, but still owns the held buf yet doesn't have a pointer > to it. > > The dirty buffer scenario may not play out in either of these contexts, It won't until a security researcher creates a specially crafted corrupt filesystem etc. :) > but I'd prefer to try and use/propagate error handling patterns that are > totally robust if at all possible. We'll eventually have that scenario > play out somewhere I'm sure and chances are this code will be > reused/copied/referenced for any new bhold+dfops situations. Thinking > about it, all it would take is some new pre-roll error handling down in > xfs_defer_finish() to quietly turn these error checks into a leak > vector. Sounds like a good place to stuff in another error injector... > This case looks like it actually expects the buffer to remain held on > return so perhaps we just need to set the *bpp pointer sooner. The > caller can then commit the final tx and set it's *bpp (if the commit is > successful) or release bp itself and cancel the tx if this function > returned an error. Hm. xfs_dquot_disk_alloc passes the dquot buffer pointer a couple of levels up (to xfs_qm_dqread) so that we can fill the incore dquot with whatever's written on disk, so I think we can stick to returning 0 with the buffer pointer set, or returning an error and no buffer at all. I see two advantages of that: first, both hold-defer-unhold cases have exactly the same code, and we can simplify the error handling in xfs_qm_dqread_alloc. > > *bpp = bp; > > @@ -521,7 +520,7 @@ xfs_qm_dqread_alloc( > > struct xfs_buf **bpp) > > { > > struct xfs_trans *tp; > > - struct xfs_buf *bp; > > + struct xfs_buf *bp = NULL; > > int error; > > > > error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_dqalloc, > > @@ -529,24 +528,25 @@ xfs_qm_dqread_alloc( > > if (error) > > goto err; > > > > + /* > > + * This function returns with the buffer held to the transaction, so > > + * either we pass it back to our caller still locked or cancel the > > + * transaction and unlock it manually if we're not passing it back. > > + */ > > error = xfs_dquot_disk_alloc(&tp, dqp, &bp); > > if (error) > > goto err_cancel; > > > > error = xfs_trans_commit(tp); > > - if (error) { > > - /* > > - * Buffer was held to the transaction, so we have to unlock it > > - * manually here because we're not passing it back. > > - */ > > - xfs_buf_relse(bp); > > - goto err; > > - } > > + if (error) > > + goto err_cancel; > > A failed xfs_trans_commit() is equivalent to an xfs_trans_cancel() (it > frees the tx) so there's no need to call it here. Will fix. --D > Brian > > > *bpp = bp; > > return 0; > > > > err_cancel: > > xfs_trans_cancel(tp); > > + if (bp) > > + xfs_buf_relse(bp); > > err: > > return error; > > }