[PATCH] xfs: fix btree splitting failure when AG space about to be exhausted

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Recently, I noticed an special problem on our products. The disk space
is sufficient, while encounter btree split failure. After looking inside
the disk, I found the specific AG space is about to be exhausted.
More seriously, under this special situation, btree split failure will
always be triggered and XFS filesystem is unavailable.

After analysis the disk image and the AG, which seem same as Gao Xiang met
before [1], The slight difference is my problem is triggered by creating
new inode, I read through your discussion the mailing list[1], I think it's
probably the same root cause.

As Dave pointed out, args->minleft has an *exact* meaning, both inode fork
allocations and inode chunk extent allocation pre-calculate args->minleft
to ensure inobt record insertion succeed in any circumstances. But, this
guarantee dosen't seem to be reliable, especially when it happens to meet
cnt&bno btree splitting. Gao Xiang proposed an solution by adding postalloc
to make current allocation reserve more space for inobt splitting, I think
it's ok to slove their own problem, but may not be sloved completely, since
inode chunk extent allocation may failed during inobt splitting too.

Meanwhile, Gao Xiang also noticed strip align requirement may increase
probablility of the problem, which is totally true. I think the reason is
that align requirement may lead to one free extent divied into two, which
increase probablility of the problem. eg: we needs an extent length 4 and
align 4 and find a suitable free extent [3,10] ([start,length]), after this
allocation, the lefted extents are [3,1] and [9,5]. Therefore, alignment
allocation is more likely to increase the number of free extents, then may
lead cnt&bno btree splitting, which increases likelihood of the problem.

In my opinion, XFS has avariety of btrees, in order to ensure the growth of
the btrees, XFS use args->minleft/agfl/perag reservation to achieve this,
which corresponds as follows:

perag reservation: for reverse map & freeinode & refcount btree
args->minleft    : for inode btree & inode/attr fork btree
agfl             : for block btree (bnobt) & count btree (cntbt)
(rmapbt is exception, it has reservation but get free block from agfl,
since agfl blocks are considered as free when calculate available space,
and rmapbt allocates block from it's reservation, *rmapbt growth* don't
affect available space calculation, so don't care about it)

Before each allocation need to calculate or prepare these reservation,
more precisely, call `xfs_alloc_space_available` to determine whether there
is enough space to complete current allocation, including those involved
tree growth. if xfs_alloc_space_available is true which means tree growth
can definitely success.

I think the root cause of the current problem is when AG space is about to
exhausted and happened to encounter cnt&bno btree splitting,
`xfs_alloc_space_available` does't work well.

Because, considering btree splitting during "space allocation", we will
meet block allocations many times for each "space allocation":
1st. allocation for space required at the beginning, i.e extent A1.
2nd. then need to *insert* or *update* free extent to cntbt & bnobt, which
     *may* lead to btree splitting and need allocation (as explained above)
3rd. extent A1 need to insert inode/attr fork btree or inobt etc.. which
     *may* also lead to splitting and allocation

So, during block allocations, which will calling xfs_alloc_space_available
at least 2 times (2nd don't call it, because bnt&cnt btree get block from
agfl). Since the 1st judgement of space available, it has guaranteed there
is enough space to complete 2nd and 3rd allocation, *BUT* after 2nd
allocation, if the height bno&cnt btree increase, min_freelist of agfl will
increase, more acurrate, xfs_alloc_min_freelist will increase, which may
lead to 3rd allocation failed, and 3rd allocation failure will make our xfs
filesystem unavailable.

According to the above description, since every space allocation, we have
guaranteed agfl min free list is enough for bno&cnt btree growth by
calling `xfs_alloc_fix_freelist` to reserve enough agfl before we do 1st
allocation. So the 2nd allocation will always succeed. args->minleft can
guaranteed 3rd allocation will make it, it is no need to rejudge space
available in 3rd allocation, so xfs_alloc_space_available should always
be true.

In summary, since btree alloc_block don't need any minleft, both 2rd and
3rd allocation are allocation for btree. So just treat these allocation
same as freeing extents (caller with flag XFS_ALLOC_FLAG_FREEING set).

[1] https://lore.kernel.org/linux-xfs/20221109034802.40322-1-hsiangkao@xxxxxxxxxxxxxxxxx/

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Guo Xuenan <guoxuenan@xxxxxxxxxx>
---
 fs/xfs/libxfs/xfs_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 989cf341779b..6d9ada93aec3 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2305,7 +2305,7 @@ xfs_alloc_space_available(
 	int			available;
 	xfs_extlen_t		agflcount;
 
-	if (flags & XFS_ALLOC_FLAG_FREEING)
+	if (flags & XFS_ALLOC_FLAG_FREEING || args->minleft == 0)
 		return true;
 
 	reservation = xfs_ag_resv_needed(pag, args->resv);
-- 
2.31.1




[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux