Re: [PATCH RFC 1/2] xfs: fix potential create file failed

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 22 Apr 2024 09:24:52 +1000

On Fri, Apr 19, 2024 at 02:18:47PM +0800, Ye Bin wrote:
> In the file system expansion test and concurrent file creation and
> writing scenarios, file creation fails occasionally.
> The detailed test scheme is as follows:
> 1. If the remaining space is less than 128 MB, expand the space by 1 GB;
>    --xfs_growfs /$DEV -D $bc -m 100
> 2. 32 processes create a file every 0.5s and write 4 KB to 4 MB data randomly.
>    --filesize=$((RANDOM % 1024 + 1))
>    --dd if=/dev/zero oflag=direct of=$filename bs=4K count=$filesize

So this is racing growfs against a heap of concurrent file creates
and 4kB direct IO writes which are doing individual 4kb allocations?

And because of the way growfs works, after a while all the free
space ends up in the last AG? i.e. there is effectively only one AG
with free space that can be allocated out of?

> There is a possibility that the file fails to be created after the preceding
> steps are performed. However, when file creation fails, there are still
> hundreds of megabytes of free space.

Ok, but the problem with having transient ENOSPC conditions reported
to the application is ....? I mean, this is pretty much expected
behaviour when you are racing a highly concurrent create/write
workload against ENOSPC and growfs....

> The process of failing to create a file
> is as follows:
>       Direct write                            create file
> xfs_direct_write_iomap_begin
>  xfs_iomap_write_direct
>    ...
>   xfs_alloc_ag_vextent_near
>    xfs_alloc_cur_finish
>     xfs_alloc_fixup_trees
>      xfs_btree_delete
>       xfs_btree_delrec
>        xfs_allocbt_update_lastrec
>         case LASTREC_DELREC:
> 	 numrecs = xfs_btree_get_numrecs(block);
> 	 if (numrecs == 0)
> 	  len = 0;
> 	 pag->pagf_longest = be32_to_cpu(len);

At this point, the freespace tree is -empty-.

i.e. We just removed the last free space from the tree, and the
freespace code has no idea what the caller is going to do with that
free space it just removed from the tree. It may well be using all
of it, and so the AG must be marked as empty at this point as there
is no free space remaining in it.

Sure, xfs_alloc_fixup_trees() may well be about to insert the unused
part of that back into the tree (i.e. the size has changed), but the
btree tracking code has no way of knowing this, nor does
xfs_alloc_fixup_trees() have any way of knowing that this is
actually the last freespace extent in the btree that it is
manipulating....

So, yeah, the lastrec update code has to do this, even for transient
updates.

> 	                                xfs_generic_create
> 					 xfs_create
>                                           xfs_dialloc
> 					   for_each_perag_wrap_at
> 					    xfs_dialloc_good_ag

Did you check what this code does? It samples pag->pagf_longest
to determine if the AG has enough space for the inode chunk
allocation. So this check here ran before the last free space record
was removed from the AG and said "there's enough space here!".

This is how we scan AGs to find an allocation target - we do
unlocked peeks at the freespace state and determine the best AG to
allocate from. If this doesn't find an AG with enough space, then
we'll declare ENOSPC here without even getting to the allocation
code that takes AGF locks. We most certainly don't want this code
taking locks here to avoid transient values....

> 					     xfs_dialloc_try_ag  ->The last AG to alloc inode
> 					      xfs_ialloc_ag_alloc
> 					       ...
> 					       xfs_alloc_vextent_prepare_ag
> 					        xfs_alloc_fix_freelist
> 						 xfs_alloc_space_available
> 						  longest = xfs_alloc_longest_free_extent()
> 	                                           ->As pag->pagf_longest is equal to zero
> 						     longest is equal 1
> 	                                          if (longest < alloc_len)
> 	 						return false;
> 						  -> will return false, no space to
> 						     allocate for inode

But at this point where we sample the AG freespace we find that it
is at ENOSPC, so the allocation is failed and returned back to the
for_each_perag_wrap_at() loop for it to try a different AG. If no
AGs can be allocated from, then ENOSPC is returned to the user.

Essentially, the test case is resulting in a single freespace extent
remaining in the very last AG with space available.  This is a large
extent because of the fact grwofs is adding contiguous 1GB freespace
extents one at a time, and this ends up being the only free space
available for allocation from. This is purely caused by the repeated
growfs component of the workload, and hence when any allocation
from the last AG fails, there is no fallback and userspace gets
ENOSPC.

> As there isn't hold AGF buffer's lock when call xfs_alloc_space_available()
> first time in xfs_alloc_fix_freelist(). If remove the last right leaf record
> of CNT btree will update 'pag->pagf_longest' with zero. This process is hold
> AGF buffer's lock.
>
> Above test case constructs repeatedly allocate space within
> the same AG, increasing the concurrency between the two processes.
> To solve above issue, there's need to hold AGF buffer's lock before call
> xfs_alloc_space_available() to judge space is available for request.

Please think a little more carefully about what the code you are
changing actually does. xfs_alloc_space_available() uses a
"check-lock-check again" algorithm, and so moving the locking so it
is a "lock-check-check again" makes the second set of checks
redundant.....

----

Regardless, as I implied above about the AG selection scanning, I
don't think this is the right way to fix this issue.  We don't want
AGs at ENOSPC to require scanners to take the AGF lock to determine
if they really are at ENOSPC or not.  This adds significant overhead
and latency to the allocation algorithm, especially as AG count
climbs into the thousands and allocations may have to scan hundreds
to thousands of AGs as we near ENOSPC to find the best candidate to
allocate from.

What we actually want is for pag->pagf_longest not to change
transiently to zero in xfs_alloc_fixup_trees(). If the delrec that
zeroes the pagf_longest field is going to follow it up with an
insrec that will set it back to a valid value, we really should not
be doing the zeroing in the first place.

Further, the only btree that tracks the right edge of the btree is
the by-count allocbt. This isn't "generic" functionality, even
though it is implemented through the generic btree code. If we lift
->update_lastrec from the generic code and do it directly in
xfs_alloc.c whenever we are finished with a by-count tree update
and the cursor points to a record in the right-most leaf of the
tree, then we run the lastrec update directly at that point. 
By decoupling the lastrec updates from the individual record
manipulations, we can make the transients disappear completely.

This means all the unlocked AG scanning and the
the check-lock-check again algorithm in xfs_alloc_space_available()
would never see transients and so work reliably in these rare
post-growfs-from-total-ENOSPC situations...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx