Hi Xuenan, On Wed, Dec 28, 2022 at 09:32:04PM +0800, Guo Xuenan wrote: > Recently, I noticed an special problem on our products. The disk space > is sufficient, while encounter btree split failure. After looking inside > the disk, I found the specific AG space is about to be exhausted. > More seriously, under this special situation, btree split failure will > always be triggered and XFS filesystem is unavailable. > > After analysis the disk image and the AG, which seem same as Gao Xiang met > before [1], The slight difference is my problem is triggered by creating > new inode, I read through your discussion the mailing list[1], I think it's > probably the same root cause. > > As Dave pointed out, args->minleft has an *exact* meaning, both inode fork > allocations and inode chunk extent allocation pre-calculate args->minleft > to ensure inobt record insertion succeed in any circumstances. But, this > guarantee dosen't seem to be reliable, especially when it happens to meet > cnt&bno btree splitting. Gao Xiang proposed an solution by adding postalloc > to make current allocation reserve more space for inobt splitting, I think > it's ok to slove their own problem, but may not be sloved completely, since > inode chunk extent allocation may failed during inobt splitting too. > > Meanwhile, Gao Xiang also noticed strip align requirement may increase > probablility of the problem, which is totally true. I think the reason is > that align requirement may lead to one free extent divied into two, which > increase probablility of the problem. eg: we needs an extent length 4 and > align 4 and find a suitable free extent [3,10] ([start,length]), after this > allocation, the lefted extents are [3,1] and [9,5]. Therefore, alignment > allocation is more likely to increase the number of free extents, then may > lead cnt&bno btree splitting, which increases likelihood of the problem. > > In my opinion, XFS has avariety of btrees, in order to ensure the growth of > the btrees, XFS use args->minleft/agfl/perag reservation to achieve this, > which corresponds as follows: > > perag reservation: for reverse map & freeinode & refcount btree > args->minleft : for inode btree & inode/attr fork btree > agfl : for block btree (bnobt) & count btree (cntbt) > (rmapbt is exception, it has reservation but get free block from agfl, > since agfl blocks are considered as free when calculate available space, > and rmapbt allocates block from it's reservation, *rmapbt growth* don't > affect available space calculation, so don't care about it) > > Before each allocation need to calculate or prepare these reservation, > more precisely, call `xfs_alloc_space_available` to determine whether there > is enough space to complete current allocation, including those involved > tree growth. if xfs_alloc_space_available is true which means tree growth > can definitely success. > > I think the root cause of the current problem is when AG space is about to > exhausted and happened to encounter cnt&bno btree splitting, > `xfs_alloc_space_available` does't work well. > > Because, considering btree splitting during "space allocation", we will > meet block allocations many times for each "space allocation": > 1st. allocation for space required at the beginning, i.e extent A1. > 2nd. then need to *insert* or *update* free extent to cntbt & bnobt, which > *may* lead to btree splitting and need allocation (as explained above) > 3rd. extent A1 need to insert inode/attr fork btree or inobt etc.. which > *may* also lead to splitting and allocation > > So, during block allocations, which will calling xfs_alloc_space_available > at least 2 times (2nd don't call it, because bnt&cnt btree get block from > agfl). Since the 1st judgement of space available, it has guaranteed there > is enough space to complete 2nd and 3rd allocation, *BUT* after 2nd > allocation, if the height bno&cnt btree increase, min_freelist of agfl will > increase, more acurrate, xfs_alloc_min_freelist will increase, which may > lead to 3rd allocation failed, and 3rd allocation failure will make our xfs > filesystem unavailable. > > According to the above description, since every space allocation, we have > guaranteed agfl min free list is enough for bno&cnt btree growth by > calling `xfs_alloc_fix_freelist` to reserve enough agfl before we do 1st Personally I'm not sure if it's the right way since I don't think we should select _this AG_ at all in the 1st allocation (yet due to lack of necessary reservation for bnobt/cntbt splits, currently it could select it by mistake) rather than silently select _this AG_ which may could impact another reservation and cause even very rare corner cases to users. Anyway, I have a low-confident unfinished AGFL reservation patchset without verification (I don't have a real reproducer-- need to seek Zorro and I badly got COVID with fever since the last week..) >From 7aed129bbd23fa2cc67c1818f6e904d681b43858 Mon Sep 17 00:00:00 2001 From: Gao Xiang <hsiangkao@xxxxxxxxxxxxxxxxx> Date: Mon, 19 Dec 2022 08:03:52 +0800 Subject: [PATCH 1/2] xfs: add AGFL refilling reservation Signed-off-by: Gao Xiang <hsiangkao@xxxxxxxxxxxxxxxxx> --- fs/xfs/libxfs/xfs_ag.h | 4 +-- fs/xfs/libxfs/xfs_ag_resv.c | 52 +++++++++++++++++++++++++++---------- fs/xfs/libxfs/xfs_ag_resv.h | 3 ++- 3 files changed, 42 insertions(+), 17 deletions(-) diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h index 191b22b9a35b..1e46d3068afe 100644 --- a/fs/xfs/libxfs/xfs_ag.h +++ b/fs/xfs/libxfs/xfs_ag.h @@ -61,8 +61,8 @@ struct xfs_perag { /* Blocks reserved for all kinds of metadata. */ struct xfs_ag_resv pag_meta_resv; - /* Blocks reserved for the reverse mapping btree. */ - struct xfs_ag_resv pag_rmapbt_resv; + /* Blocks reserved for the reverse mapping btree and AGFL refilling. */ + struct xfs_ag_resv pag_agfl_resv; /* for rcu-safe freeing */ struct rcu_head rcu_head; diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c index 5af123d13a63..f9bc190dd718 100644 --- a/fs/xfs/libxfs/xfs_ag_resv.c +++ b/fs/xfs/libxfs/xfs_ag_resv.c @@ -75,13 +75,13 @@ xfs_ag_resv_critical( switch (type) { case XFS_AG_RESV_METADATA: - avail = pag->pagf_freeblks - pag->pag_rmapbt_resv.ar_reserved; + avail = pag->pagf_freeblks - pag->pag_agfl_resv.ar_reserved; orig = pag->pag_meta_resv.ar_asked; break; case XFS_AG_RESV_RMAPBT: avail = pag->pagf_freeblks + pag->pagf_flcount - pag->pag_meta_resv.ar_reserved; - orig = pag->pag_rmapbt_resv.ar_asked; + orig = pag->pag_agfl_resv.ar_asked; break; default: ASSERT(0); @@ -107,10 +107,11 @@ xfs_ag_resv_needed( { xfs_extlen_t len; - len = pag->pag_meta_resv.ar_reserved + pag->pag_rmapbt_resv.ar_reserved; + len = pag->pag_meta_resv.ar_reserved + pag->pag_agfl_resv.ar_reserved; switch (type) { case XFS_AG_RESV_METADATA: case XFS_AG_RESV_RMAPBT: + case XFS_AG_RESV_AGFL: len -= xfs_perag_resv(pag, type)->ar_reserved; break; case XFS_AG_RESV_NONE: @@ -145,7 +146,7 @@ __xfs_ag_resv_free( * considered "free", so whatever was reserved at mount time must be * given back at umount. */ - if (type == XFS_AG_RESV_RMAPBT) + if (type == XFS_AG_RESV_RMAPBT || type == XFS_AG_RESV_AGFL) oldresv = resv->ar_orig_reserved; else oldresv = resv->ar_reserved; @@ -168,7 +169,7 @@ xfs_ag_resv_free( int error; int err2; - error = __xfs_ag_resv_free(pag, XFS_AG_RESV_RMAPBT); + error = __xfs_ag_resv_free(pag, XFS_AG_RESV_AGFL); err2 = __xfs_ag_resv_free(pag, XFS_AG_RESV_METADATA); if (err2 && !error) error = err2; @@ -191,7 +192,7 @@ __xfs_ag_resv_init( ask = used; switch (type) { - case XFS_AG_RESV_RMAPBT: + case XFS_AG_RESV_AGFL: /* * Space taken by the rmapbt is not subtracted from fdblocks * because the rmapbt lives in the free space. Here we must @@ -244,6 +245,27 @@ __xfs_ag_resv_init( return 0; } +int +xfs_agfl_calc_reserves( + struct xfs_mount *mp, + struct xfs_trans *tp, + struct xfs_perag *pag, + xfs_extlen_t *ask, + xfs_extlen_t *used) +{ + xfs_extlen_t len, allocbtres; + + ASSERT(mp->m_alloc_maxlevels); + + allocbtres = mp->m_alloc_maxlevels; + len = 2 * allocbtres * + max(mp->m_bm_maxlevels[0], mp->m_bm_maxlevels[1]); + len = max(len, allocbtres * M_IGEO(mp)->inobt_maxlevels); + + *ask += len; + return 0; +} + /* Create a per-AG block reservation. */ int xfs_ag_resv_init( @@ -296,15 +318,19 @@ xfs_ag_resv_init( has_resv = true; } - /* Create the RMAPBT metadata reservation */ - if (pag->pag_rmapbt_resv.ar_asked == 0) { + /* Create the RMAPBT metadata and AGFL refilling reservation */ + if (pag->pag_agfl_resv.ar_asked == 0) { ask = used = 0; error = xfs_rmapbt_calc_reserves(mp, tp, pag, &ask, &used); if (error) goto out; - error = __xfs_ag_resv_init(pag, XFS_AG_RESV_RMAPBT, ask, used); + error = xfs_agfl_calc_reserves(mp, tp, pag, &ask, &used); + if (error) + goto out; + + error = __xfs_ag_resv_init(pag, XFS_AG_RESV_AGFL, ask, used); if (error) goto out; if (ask) @@ -336,7 +362,7 @@ xfs_ag_resv_init( */ if (!error && xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved + - xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved > + xfs_perag_resv(pag, XFS_AG_RESV_AGFL)->ar_reserved > pag->pagf_freeblks + pag->pagf_flcount) error = -ENOSPC; } @@ -359,7 +385,6 @@ xfs_ag_resv_alloc_extent( switch (type) { case XFS_AG_RESV_AGFL: - return; case XFS_AG_RESV_METADATA: case XFS_AG_RESV_RMAPBT: resv = xfs_perag_resv(pag, type); @@ -376,7 +401,7 @@ xfs_ag_resv_alloc_extent( len = min_t(xfs_extlen_t, args->len, resv->ar_reserved); resv->ar_reserved -= len; - if (type == XFS_AG_RESV_RMAPBT) + if (type == XFS_AG_RESV_RMAPBT || type == XFS_AG_RESV_AGFL) return; /* Allocations of reserved blocks only need on-disk sb updates... */ xfs_trans_mod_sb(args->tp, XFS_TRANS_SB_RES_FDBLOCKS, -(int64_t)len); @@ -401,7 +426,6 @@ xfs_ag_resv_free_extent( switch (type) { case XFS_AG_RESV_AGFL: - return; case XFS_AG_RESV_METADATA: case XFS_AG_RESV_RMAPBT: resv = xfs_perag_resv(pag, type); @@ -416,7 +440,7 @@ xfs_ag_resv_free_extent( leftover = min_t(xfs_extlen_t, len, resv->ar_asked - resv->ar_reserved); resv->ar_reserved += leftover; - if (type == XFS_AG_RESV_RMAPBT) + if (type == XFS_AG_RESV_RMAPBT || type == XFS_AG_RESV_AGFL) return; /* Freeing into the reserved pool only requires on-disk update... */ xfs_trans_mod_sb(tp, XFS_TRANS_SB_RES_FDBLOCKS, len); diff --git a/fs/xfs/libxfs/xfs_ag_resv.h b/fs/xfs/libxfs/xfs_ag_resv.h index b74b210008ea..e9e963c8aebb 100644 --- a/fs/xfs/libxfs/xfs_ag_resv.h +++ b/fs/xfs/libxfs/xfs_ag_resv.h @@ -27,7 +27,8 @@ xfs_perag_resv( case XFS_AG_RESV_METADATA: return &pag->pag_meta_resv; case XFS_AG_RESV_RMAPBT: - return &pag->pag_rmapbt_resv; + case XFS_AG_RESV_AGFL: + return &pag->pag_agfl_resv; default: return NULL; } -- 2.24.4 >From 88e218d652f62d07a48510f9069a6747fed9ec0a Mon Sep 17 00:00:00 2001 From: Gao Xiang <hsiangkao@xxxxxxxxxxxxxxxxx> Date: Thu, 3 Nov 2022 21:10:25 +0800 Subject: [PATCH 2/2] xfs: extend the freelist before available space check Reported-by: Zirong Lang <zlang@xxxxxxxxxx> Signed-off-by: Gao Xiang <hsiangkao@xxxxxxxxxxxxxxxxx> Signed-off-by: Gao Xiang <hsiangkao@xxxxxxxxxxxxxxxxx> --- fs/xfs/libxfs/xfs_alloc.c | 189 ++++++++++++++++++++++++-------------- 1 file changed, 121 insertions(+), 68 deletions(-) diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index 989cf341779b..b72106bc9a94 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -2362,7 +2362,6 @@ xfs_free_agfl_block( if (error) return error; xfs_trans_binval(tp, bp); - return 0; } @@ -2583,6 +2582,86 @@ xfs_exact_minlen_extent_available( } #endif +/* + * The freelist has to be in a good state before the available space check + * since multiple allocations could be performed from a single AG and + * transaction under certain conditions. For example, A bmbt allocation + * request made for inode extent to bmap format conversion after an extent + * allocation is expected to be satisfied by the same AG. Such bmap conversion + * allocation can fail due to the available space check if allocbt required an + * extra btree block from the freelist in the previous allocation but without + * making the freelist longer. + */ +int +xfs_fill_agfl( + struct xfs_alloc_arg *args, + int flags, + xfs_extlen_t need, + struct xfs_buf *agbp) +{ + struct xfs_trans *tp = args->tp; + struct xfs_perag *pag = agbp->b_pag; + struct xfs_alloc_arg targs = { + .tp = tp, + .mp = tp->t_mountp, + .agbp = agbp, + .agno = args->agno, + .alignment = 1, + .minlen = 1, + .prod = 1, + .type = XFS_ALLOCTYPE_THIS_AG, + .pag = pag, + }; + struct xfs_buf *agflbp = NULL; + xfs_agblock_t bno; + int error; + + if (flags & XFS_ALLOC_FLAG_NORMAP) + targs.oinfo = XFS_RMAP_OINFO_SKIP_UPDATE; + else + targs.oinfo = XFS_RMAP_OINFO_AG; + + error = xfs_alloc_read_agfl(pag, tp, &agflbp); + if (error) + return error; + + /* Make the freelist longer if it's too short. */ + while (pag->pagf_flcount < need) { + targs.agbno = 0; + targs.maxlen = need - pag->pagf_flcount; + targs.resv = XFS_AG_RESV_AGFL; + + /* Allocate as many blocks as possible at once. */ + error = xfs_alloc_ag_vextent(&targs); + if (error) + goto out_agflbp_relse; + + /* + * Stop if we run out. Won't happen if callers are obeying + * the restrictions correctly. Can happen for free calls + * on a completely full ag. + */ + if (targs.agbno == NULLAGBLOCK) { + if (flags & XFS_ALLOC_FLAG_FREEING) + break; + goto out_agflbp_relse; + } + + /* + * Put each allocated block on the list. + */ + for (bno = targs.agbno; bno < targs.agbno + targs.len; bno++) { + error = xfs_alloc_put_freelist(pag, tp, agbp, + agflbp, bno, 0); + if (error) + goto out_agflbp_relse; + } + } +out_agflbp_relse: + xfs_trans_brelse(tp, agflbp); + return error; +} + /* * Decide whether to use this allocation group for this allocation. * If so, fix up the btree freelist's size. @@ -2596,8 +2675,7 @@ xfs_alloc_fix_freelist( struct xfs_perag *pag = args->pag; struct xfs_trans *tp = args->tp; struct xfs_buf *agbp = NULL; - struct xfs_buf *agflbp = NULL; - struct xfs_alloc_arg targs; /* local allocation arguments */ + struct xfs_owner_info oinfo; xfs_agblock_t bno; /* freelist block */ xfs_extlen_t need; /* total blocks needed in freelist */ int error = 0; @@ -2626,22 +2704,45 @@ xfs_alloc_fix_freelist( goto out_agbp_relse; } - need = xfs_alloc_min_freelist(mp, pag); - if (!xfs_alloc_space_available(args, need, flags | - XFS_ALLOC_FLAG_CHECK)) - goto out_agbp_relse; - /* - * Get the a.g. freespace buffer. - * Can fail if we're not blocking on locks, and it's held. + * See the comment above xfs_fill_agfl() for the reason why we need to + * make freelist longer here. Assumed that such case is quite rare, so + * in order to simplify the code, let's take agbp unconditionally. */ - if (!agbp) { - error = xfs_alloc_read_agf(pag, tp, flags, &agbp); - if (error) { - /* Couldn't lock the AGF so skip this AG. */ - if (error == -EAGAIN) - error = 0; - goto out_no_agbp; + need = xfs_alloc_min_freelist(mp, pag); + if (pag->pagf_flcount < need) { + /* + * Get the a.g. freespace buffer. + * Can fail if we're not blocking on locks, and it's held. + */ + if (!agbp) { + error = xfs_alloc_read_agf(pag, tp, flags, &agbp); + if (error) { + /* Couldn't lock the AGF so skip this AG. */ + if (error == -EAGAIN) + error = 0; + goto out_no_agbp; + } + } + + need = xfs_alloc_min_freelist(mp, pag); + error = xfs_fill_agfl(args, flags, need, agbp); + if (error) + goto out_agbp_relse; + } else { + if (!xfs_alloc_space_available(args, need, flags | + XFS_ALLOC_FLAG_CHECK)) + goto out_agbp_relse; + + /* Get the a.g. freespace buffer. */ + if (!agbp) { + error = xfs_alloc_read_agf(pag, tp, flags, &agbp); + if (error) { + /* Couldn't lock the AGF so skip this AG. */ + if (error == -EAGAIN) + error = 0; + goto out_no_agbp; + } } } @@ -2687,69 +2788,21 @@ xfs_alloc_fix_freelist( * regenerated AGFL, bnobt, and cntbt. See repair/phase5.c and * repair/rmap.c in xfsprogs for details. */ - memset(&targs, 0, sizeof(targs)); - /* struct copy below */ if (flags & XFS_ALLOC_FLAG_NORMAP) - targs.oinfo = XFS_RMAP_OINFO_SKIP_UPDATE; + oinfo = XFS_RMAP_OINFO_SKIP_UPDATE; else - targs.oinfo = XFS_RMAP_OINFO_AG; + oinfo = XFS_RMAP_OINFO_AG; while (!(flags & XFS_ALLOC_FLAG_NOSHRINK) && pag->pagf_flcount > need) { error = xfs_alloc_get_freelist(pag, tp, agbp, &bno, 0); if (error) goto out_agbp_relse; /* defer agfl frees */ - xfs_defer_agfl_block(tp, args->agno, bno, &targs.oinfo); - } - - targs.tp = tp; - targs.mp = mp; - targs.agbp = agbp; - targs.agno = args->agno; - targs.alignment = targs.minlen = targs.prod = 1; - targs.type = XFS_ALLOCTYPE_THIS_AG; - targs.pag = pag; - error = xfs_alloc_read_agfl(pag, tp, &agflbp); - if (error) - goto out_agbp_relse; - - /* Make the freelist longer if it's too short. */ - while (pag->pagf_flcount < need) { - targs.agbno = 0; - targs.maxlen = need - pag->pagf_flcount; - targs.resv = XFS_AG_RESV_AGFL; - - /* Allocate as many blocks as possible at once. */ - error = xfs_alloc_ag_vextent(&targs); - if (error) - goto out_agflbp_relse; - - /* - * Stop if we run out. Won't happen if callers are obeying - * the restrictions correctly. Can happen for free calls - * on a completely full ag. - */ - if (targs.agbno == NULLAGBLOCK) { - if (flags & XFS_ALLOC_FLAG_FREEING) - break; - goto out_agflbp_relse; - } - /* - * Put each allocated block on the list. - */ - for (bno = targs.agbno; bno < targs.agbno + targs.len; bno++) { - error = xfs_alloc_put_freelist(pag, tp, agbp, - agflbp, bno, 0); - if (error) - goto out_agflbp_relse; - } + xfs_defer_agfl_block(tp, args->agno, bno, &oinfo); } - xfs_trans_brelse(tp, agflbp); args->agbp = agbp; return 0; -out_agflbp_relse: - xfs_trans_brelse(tp, agflbp); out_agbp_relse: if (agbp) xfs_trans_brelse(tp, agbp); -- 2.24.4