On Fri, Sep 30, 2016 at 12:40:40PM -0700, Darrick J. Wong wrote: > On Fri, Sep 30, 2016 at 12:21:03PM -0400, Brian Foster wrote: > > On Thu, Sep 29, 2016 at 08:07:05PM -0700, Darrick J. Wong wrote: > > > Plumb in the upper level interface to schedule and finish deferred > > > refcount operations via the deferred ops mechanism. > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > > --- > > > fs/xfs/libxfs/xfs_defer.h | 1 > > > fs/xfs/libxfs/xfs_refcount.c | 170 ++++++++++++++++++++++++++++++++++++++ > > > fs/xfs/libxfs/xfs_refcount.h | 12 +++ > > > fs/xfs/xfs_error.h | 4 + > > > fs/xfs/xfs_refcount_item.c | 73 ++++++++++++++++ > > > fs/xfs/xfs_super.c | 1 > > > fs/xfs/xfs_trace.h | 3 + > > > fs/xfs/xfs_trans.h | 8 +- > > > fs/xfs/xfs_trans_refcount.c | 186 ++++++++++++++++++++++++++++++++++++++++++ > > > 9 files changed, 452 insertions(+), 6 deletions(-) > > > > > > > > ... > > > diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c > > > index 599a8d2..e44007a 100644 > > > --- a/fs/xfs/xfs_refcount_item.c > > > +++ b/fs/xfs/xfs_refcount_item.c > > > @@ -396,9 +396,19 @@ xfs_cui_recover( > > > { > > > int i; > > > int error = 0; > > > + unsigned int refc_type; > > > struct xfs_phys_extent *refc; > > > xfs_fsblock_t startblock_fsb; > > > bool op_ok; > > > + struct xfs_cud_log_item *cudp; > > > + struct xfs_trans *tp; > > > + struct xfs_btree_cur *rcur = NULL; > > > + enum xfs_refcount_intent_type type; > > > + xfs_fsblock_t firstfsb; > > > + xfs_extlen_t adjusted; > > > + struct xfs_bmbt_irec irec; > > > + struct xfs_defer_ops dfops; > > > + bool requeue_only = false; > > > > > > ASSERT(!test_bit(XFS_CUI_RECOVERED, &cuip->cui_flags)); > > > > > > @@ -437,7 +447,68 @@ xfs_cui_recover( > > > } > > > } > > > > > > + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp); > > > + if (error) > > > + return error; > > > + cudp = xfs_trans_get_cud(tp, cuip); > > > + > > > + xfs_defer_init(&dfops, &firstfsb); > > > > A comment would be nice here to point out the approach. E.g., that > > refcount updates are initially deferred under normal runtime > > circumstances, they handle reservation usage internally/dynamically, and > > that since we're in recovery, we start the initial update directly and > > defer the rest that won't fit in the transaction (worded better and > > assuming I understand all that correctly ;P). > > Yep, your understanding is correct. I'll put that in as a comment. > > > (Sorry for the comment requests and whatnot, BTW. I'm catching up from a > > couple weeks of PTO, probably late to the game and not up to speed on > > the latest status of the patchset. Feel free to defer, drop, or > > conditionalize any of the aesthetic stuff to whenever is opportune if > > this stuff is otherwise close to merge). > > NP. I appreciate review whenever I can get it. :) > > (Plus, you found a bug! :) :)) > > > > + for (i = 0; i < cuip->cui_format.cui_nextents; i++) { > > > + refc = &cuip->cui_format.cui_extents[i]; > > > + refc_type = refc->pe_flags & XFS_REFCOUNT_EXTENT_TYPE_MASK; > > > + switch (refc_type) { > > > + case XFS_REFCOUNT_INCREASE: > > > + case XFS_REFCOUNT_DECREASE: > > > + case XFS_REFCOUNT_ALLOC_COW: > > > + case XFS_REFCOUNT_FREE_COW: > > > + type = refc_type; > > > + break; > > > + default: > > > + error = -EFSCORRUPTED; > > > + goto abort_error; > > > + } > > > + if (requeue_only) > > > + adjusted = 0; > > > + else > > > + error = xfs_trans_log_finish_refcount_update(tp, cudp, > > > + &dfops, type, refc->pe_startblock, refc->pe_len, > > > + &adjusted, &rcur); > > > + if (error) > > > + goto abort_error; > > > + > > > + /* Requeue what we didn't finish. */ > > > + if (adjusted < refc->pe_len) { > > > + irec.br_startblock = refc->pe_startblock + adjusted; > > > + irec.br_blockcount = refc->pe_len - adjusted; > > > > Hmm, so it appears we walk the range of blocks from beginning to end, > > but the refcount update code doesn't necessarily always work that way. > > It merges the boundaries and walks the middle range from start to end. > > So what happens if the call above ends up doing a right merge and then > > skips out on any other changes due to the transaction reservation? > > D'oh! You've found a bug! _refcount_adjust needs to communicate to > its caller how much work is left, which does by incrementing *adjusted > every time it finishes more work. The caller then moves the start of > the extent upwards by *adjusted. Unfortunately, as you point out, a > right merge actually does work at the upper end of the extent, and this > is not correctly accounted for. > > To fix this, I'll change _refcount_adjust to report the unfinished > extent directly to the caller, which will simplify both the function and > its callers' accounting considerably. > > Good catch! > Ok. Another option might be to perform the refcount update work in order, but whatever is easier/cleaner is probably fine (and it's probably easier to update the interface than the mechanism at this point). > > Brian > > > > P.S., Even if I'm missing something and this is not an issue, do we have > > any log recovery oriented reflink xfstests in the current test pile? If > > not, I'd suggest that something as simple as a "do a bunch of reflinks + > > xfs_io -c 'shutdown -f' + umount/mount" loop could go a long way towards > > shaking out any issues. Log recovery can be a pita and otherwise > > problems therein can go undetected for a surprising amount of time. > > xfs/{313,316,321,324,326} use the error injection mechanism to test log > recovery. > Great, thanks. What about basic reflink support for fsstress? I suppose if we had that, some of the existing fsstress->crash->recover tests would provide coverage as well. One of the things I actually do every now and then is run an infinite fsstress+remount loop in one thread and and a randomly timed (e.g., every 0-30s) fs shutdown trigger in another. That helps catch recovery issues, corruption issues, etc. Brian > --D > > > > > > + switch (type) { > > > + case XFS_REFCOUNT_INCREASE: > > > + error = xfs_refcount_increase_extent( > > > + tp->t_mountp, &dfops, &irec); > > > + break; > > > + case XFS_REFCOUNT_DECREASE: > > > + error = xfs_refcount_decrease_extent( > > > + tp->t_mountp, &dfops, &irec); > > > + break; > > > + default: > > > + ASSERT(0); > > > + } > > > + if (error) > > > + goto abort_error; > > > + requeue_only = true; > > > + } > > > + } > > > + > > > + xfs_refcount_finish_one_cleanup(tp, rcur, error); > > > + error = xfs_defer_finish(&tp, &dfops, NULL); > > > + if (error) > > > + goto abort_error; > > > set_bit(XFS_CUI_RECOVERED, &cuip->cui_flags); > > > - xfs_cui_release(cuip); > > > + error = xfs_trans_commit(tp); > > > + return error; > > > + > > > +abort_error: > > > + xfs_refcount_finish_one_cleanup(tp, rcur, error); > > > + xfs_defer_cancel(&dfops); > > > + xfs_trans_cancel(tp); > > > return error; > > > } > > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > > > index abe69c6..6234622 100644 > > > --- a/fs/xfs/xfs_super.c > > > +++ b/fs/xfs/xfs_super.c > > > @@ -1903,6 +1903,7 @@ init_xfs_fs(void) > > > > > > xfs_extent_free_init_defer_op(); > > > xfs_rmap_update_init_defer_op(); > > > + xfs_refcount_update_init_defer_op(); > > > > > > xfs_dir_startup(); > > > > > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > > > index fed1906..195a168 100644 > > > --- a/fs/xfs/xfs_trace.h > > > +++ b/fs/xfs/xfs_trace.h > > > @@ -2931,6 +2931,9 @@ DEFINE_AG_ERROR_EVENT(xfs_refcount_find_right_extent_error); > > > DEFINE_AG_EXTENT_EVENT(xfs_refcount_find_shared); > > > DEFINE_AG_EXTENT_EVENT(xfs_refcount_find_shared_result); > > > DEFINE_AG_ERROR_EVENT(xfs_refcount_find_shared_error); > > > +#define DEFINE_REFCOUNT_DEFERRED_EVENT DEFINE_PHYS_EXTENT_DEFERRED_EVENT > > > +DEFINE_REFCOUNT_DEFERRED_EVENT(xfs_refcount_defer); > > > +DEFINE_REFCOUNT_DEFERRED_EVENT(xfs_refcount_deferred); > > > > > > TRACE_EVENT(xfs_refcount_finish_one_leftover, > > > TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, > > > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h > > > index fe69e20..a7a87d2 100644 > > > --- a/fs/xfs/xfs_trans.h > > > +++ b/fs/xfs/xfs_trans.h > > > @@ -37,6 +37,8 @@ struct xfs_rud_log_item; > > > struct xfs_rui_log_item; > > > struct xfs_btree_cur; > > > struct xfs_cui_log_item; > > > +struct xfs_cud_log_item; > > > +struct xfs_defer_ops; > > > > > > typedef struct xfs_log_item { > > > struct list_head li_ail; /* AIL pointers */ > > > @@ -252,11 +254,13 @@ int xfs_trans_log_finish_rmap_update(struct xfs_trans *tp, > > > /* refcount updates */ > > > enum xfs_refcount_intent_type; > > > > > > +void xfs_refcount_update_init_defer_op(void); > > > struct xfs_cud_log_item *xfs_trans_get_cud(struct xfs_trans *tp, > > > struct xfs_cui_log_item *cuip); > > > int xfs_trans_log_finish_refcount_update(struct xfs_trans *tp, > > > - struct xfs_cud_log_item *cudp, > > > + struct xfs_cud_log_item *cudp, struct xfs_defer_ops *dfops, > > > enum xfs_refcount_intent_type type, xfs_fsblock_t startblock, > > > - xfs_extlen_t blockcount, struct xfs_btree_cur **pcur); > > > + xfs_extlen_t blockcount, xfs_extlen_t *adjusted, > > > + struct xfs_btree_cur **pcur); > > > > > > #endif /* __XFS_TRANS_H__ */ > > > diff --git a/fs/xfs/xfs_trans_refcount.c b/fs/xfs/xfs_trans_refcount.c > > > index b18d548..e3ac994 100644 > > > --- a/fs/xfs/xfs_trans_refcount.c > > > +++ b/fs/xfs/xfs_trans_refcount.c > > > @@ -56,15 +56,17 @@ int > > > xfs_trans_log_finish_refcount_update( > > > struct xfs_trans *tp, > > > struct xfs_cud_log_item *cudp, > > > + struct xfs_defer_ops *dop, > > > enum xfs_refcount_intent_type type, > > > xfs_fsblock_t startblock, > > > xfs_extlen_t blockcount, > > > + xfs_extlen_t *adjusted, > > > struct xfs_btree_cur **pcur) > > > { > > > int error; > > > > > > - /* XXX: leave this empty for now */ > > > - error = -EFSCORRUPTED; > > > + error = xfs_refcount_finish_one(tp, dop, type, startblock, > > > + blockcount, adjusted, pcur); > > > > > > /* > > > * Mark the transaction dirty, even on error. This ensures the > > > @@ -78,3 +80,183 @@ xfs_trans_log_finish_refcount_update( > > > > > > return error; > > > } > > > + > > > +/* Sort refcount intents by AG. */ > > > +static int > > > +xfs_refcount_update_diff_items( > > > + void *priv, > > > + struct list_head *a, > > > + struct list_head *b) > > > +{ > > > + struct xfs_mount *mp = priv; > > > + struct xfs_refcount_intent *ra; > > > + struct xfs_refcount_intent *rb; > > > + > > > + ra = container_of(a, struct xfs_refcount_intent, ri_list); > > > + rb = container_of(b, struct xfs_refcount_intent, ri_list); > > > + return XFS_FSB_TO_AGNO(mp, ra->ri_startblock) - > > > + XFS_FSB_TO_AGNO(mp, rb->ri_startblock); > > > +} > > > + > > > +/* Get an CUI. */ > > > +STATIC void * > > > +xfs_refcount_update_create_intent( > > > + struct xfs_trans *tp, > > > + unsigned int count) > > > +{ > > > + struct xfs_cui_log_item *cuip; > > > + > > > + ASSERT(tp != NULL); > > > + ASSERT(count > 0); > > > + > > > + cuip = xfs_cui_init(tp->t_mountp, count); > > > + ASSERT(cuip != NULL); > > > + > > > + /* > > > + * Get a log_item_desc to point at the new item. > > > + */ > > > + xfs_trans_add_item(tp, &cuip->cui_item); > > > + return cuip; > > > +} > > > + > > > +/* Set the phys extent flags for this reverse mapping. */ > > > +static void > > > +xfs_trans_set_refcount_flags( > > > + struct xfs_phys_extent *refc, > > > + enum xfs_refcount_intent_type type) > > > +{ > > > + refc->pe_flags = 0; > > > + switch (type) { > > > + case XFS_REFCOUNT_INCREASE: > > > + case XFS_REFCOUNT_DECREASE: > > > + case XFS_REFCOUNT_ALLOC_COW: > > > + case XFS_REFCOUNT_FREE_COW: > > > + refc->pe_flags |= type; > > > + break; > > > + default: > > > + ASSERT(0); > > > + } > > > +} > > > + > > > +/* Log refcount updates in the intent item. */ > > > +STATIC void > > > +xfs_refcount_update_log_item( > > > + struct xfs_trans *tp, > > > + void *intent, > > > + struct list_head *item) > > > +{ > > > + struct xfs_cui_log_item *cuip = intent; > > > + struct xfs_refcount_intent *refc; > > > + uint next_extent; > > > + struct xfs_phys_extent *ext; > > > + > > > + refc = container_of(item, struct xfs_refcount_intent, ri_list); > > > + > > > + tp->t_flags |= XFS_TRANS_DIRTY; > > > + cuip->cui_item.li_desc->lid_flags |= XFS_LID_DIRTY; > > > + > > > + /* > > > + * atomic_inc_return gives us the value after the increment; > > > + * we want to use it as an array index so we need to subtract 1 from > > > + * it. > > > + */ > > > + next_extent = atomic_inc_return(&cuip->cui_next_extent) - 1; > > > + ASSERT(next_extent < cuip->cui_format.cui_nextents); > > > + ext = &cuip->cui_format.cui_extents[next_extent]; > > > + ext->pe_startblock = refc->ri_startblock; > > > + ext->pe_len = refc->ri_blockcount; > > > + xfs_trans_set_refcount_flags(ext, refc->ri_type); > > > +} > > > + > > > +/* Get an CUD so we can process all the deferred refcount updates. */ > > > +STATIC void * > > > +xfs_refcount_update_create_done( > > > + struct xfs_trans *tp, > > > + void *intent, > > > + unsigned int count) > > > +{ > > > + return xfs_trans_get_cud(tp, intent); > > > +} > > > + > > > +/* Process a deferred refcount update. */ > > > +STATIC int > > > +xfs_refcount_update_finish_item( > > > + struct xfs_trans *tp, > > > + struct xfs_defer_ops *dop, > > > + struct list_head *item, > > > + void *done_item, > > > + void **state) > > > +{ > > > + struct xfs_refcount_intent *refc; > > > + xfs_extlen_t adjusted; > > > + int error; > > > + > > > + refc = container_of(item, struct xfs_refcount_intent, ri_list); > > > + error = xfs_trans_log_finish_refcount_update(tp, done_item, dop, > > > + refc->ri_type, > > > + refc->ri_startblock, > > > + refc->ri_blockcount, > > > + &adjusted, > > > + (struct xfs_btree_cur **)state); > > > + /* Did we run out of reservation? Requeue what we didn't finish. */ > > > + if (!error && adjusted < refc->ri_blockcount) { > > > + ASSERT(refc->ri_type == XFS_REFCOUNT_INCREASE || > > > + refc->ri_type == XFS_REFCOUNT_DECREASE); > > > + refc->ri_startblock += adjusted; > > > + refc->ri_blockcount -= adjusted; > > > + return -EAGAIN; > > > + } > > > + kmem_free(refc); > > > + return error; > > > +} > > > + > > > +/* Clean up after processing deferred refcounts. */ > > > +STATIC void > > > +xfs_refcount_update_finish_cleanup( > > > + struct xfs_trans *tp, > > > + void *state, > > > + int error) > > > +{ > > > + struct xfs_btree_cur *rcur = state; > > > + > > > + xfs_refcount_finish_one_cleanup(tp, rcur, error); > > > +} > > > + > > > +/* Abort all pending CUIs. */ > > > +STATIC void > > > +xfs_refcount_update_abort_intent( > > > + void *intent) > > > +{ > > > + xfs_cui_release(intent); > > > +} > > > + > > > +/* Cancel a deferred refcount update. */ > > > +STATIC void > > > +xfs_refcount_update_cancel_item( > > > + struct list_head *item) > > > +{ > > > + struct xfs_refcount_intent *refc; > > > + > > > + refc = container_of(item, struct xfs_refcount_intent, ri_list); > > > + kmem_free(refc); > > > +} > > > + > > > +static const struct xfs_defer_op_type xfs_refcount_update_defer_type = { > > > + .type = XFS_DEFER_OPS_TYPE_REFCOUNT, > > > + .max_items = XFS_CUI_MAX_FAST_EXTENTS, > > > + .diff_items = xfs_refcount_update_diff_items, > > > + .create_intent = xfs_refcount_update_create_intent, > > > + .abort_intent = xfs_refcount_update_abort_intent, > > > + .log_item = xfs_refcount_update_log_item, > > > + .create_done = xfs_refcount_update_create_done, > > > + .finish_item = xfs_refcount_update_finish_item, > > > + .finish_cleanup = xfs_refcount_update_finish_cleanup, > > > + .cancel_item = xfs_refcount_update_cancel_item, > > > +}; > > > + > > > +/* Register the deferred op type. */ > > > +void > > > +xfs_refcount_update_init_defer_op(void) > > > +{ > > > + xfs_defer_init_op_type(&xfs_refcount_update_defer_type); > > > +} > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html