On Fri, Apr 19, 2024 at 02:18:47PM +0800, Ye Bin wrote: > In the file system expansion test and concurrent file creation and > writing scenarios, file creation fails occasionally. > The detailed test scheme is as follows: > 1. If the remaining space is less than 128 MB, expand the space by 1 GB; > --xfs_growfs /$DEV -D $bc -m 100 > 2. 32 processes create a file every 0.5s and write 4 KB to 4 MB data randomly. > --filesize=$((RANDOM % 1024 + 1)) > --dd if=/dev/zero oflag=direct of=$filename bs=4K count=$filesize So this is racing growfs against a heap of concurrent file creates and 4kB direct IO writes which are doing individual 4kb allocations? And because of the way growfs works, after a while all the free space ends up in the last AG? i.e. there is effectively only one AG with free space that can be allocated out of? > There is a possibility that the file fails to be created after the preceding > steps are performed. However, when file creation fails, there are still > hundreds of megabytes of free space. Ok, but the problem with having transient ENOSPC conditions reported to the application is ....? I mean, this is pretty much expected behaviour when you are racing a highly concurrent create/write workload against ENOSPC and growfs.... > The process of failing to create a file > is as follows: > Direct write create file > xfs_direct_write_iomap_begin > xfs_iomap_write_direct > ... > xfs_alloc_ag_vextent_near > xfs_alloc_cur_finish > xfs_alloc_fixup_trees > xfs_btree_delete > xfs_btree_delrec > xfs_allocbt_update_lastrec > case LASTREC_DELREC: > numrecs = xfs_btree_get_numrecs(block); > if (numrecs == 0) > len = 0; > pag->pagf_longest = be32_to_cpu(len); At this point, the freespace tree is -empty-. i.e. We just removed the last free space from the tree, and the freespace code has no idea what the caller is going to do with that free space it just removed from the tree. It may well be using all of it, and so the AG must be marked as empty at this point as there is no free space remaining in it. Sure, xfs_alloc_fixup_trees() may well be about to insert the unused part of that back into the tree (i.e. the size has changed), but the btree tracking code has no way of knowing this, nor does xfs_alloc_fixup_trees() have any way of knowing that this is actually the last freespace extent in the btree that it is manipulating.... So, yeah, the lastrec update code has to do this, even for transient updates. > xfs_generic_create > xfs_create > xfs_dialloc > for_each_perag_wrap_at > xfs_dialloc_good_ag Did you check what this code does? It samples pag->pagf_longest to determine if the AG has enough space for the inode chunk allocation. So this check here ran before the last free space record was removed from the AG and said "there's enough space here!". This is how we scan AGs to find an allocation target - we do unlocked peeks at the freespace state and determine the best AG to allocate from. If this doesn't find an AG with enough space, then we'll declare ENOSPC here without even getting to the allocation code that takes AGF locks. We most certainly don't want this code taking locks here to avoid transient values.... > xfs_dialloc_try_ag ->The last AG to alloc inode > xfs_ialloc_ag_alloc > ... > xfs_alloc_vextent_prepare_ag > xfs_alloc_fix_freelist > xfs_alloc_space_available > longest = xfs_alloc_longest_free_extent() > ->As pag->pagf_longest is equal to zero > longest is equal 1 > if (longest < alloc_len) > return false; > -> will return false, no space to > allocate for inode But at this point where we sample the AG freespace we find that it is at ENOSPC, so the allocation is failed and returned back to the for_each_perag_wrap_at() loop for it to try a different AG. If no AGs can be allocated from, then ENOSPC is returned to the user. Essentially, the test case is resulting in a single freespace extent remaining in the very last AG with space available. This is a large extent because of the fact grwofs is adding contiguous 1GB freespace extents one at a time, and this ends up being the only free space available for allocation from. This is purely caused by the repeated growfs component of the workload, and hence when any allocation from the last AG fails, there is no fallback and userspace gets ENOSPC. > As there isn't hold AGF buffer's lock when call xfs_alloc_space_available() > first time in xfs_alloc_fix_freelist(). If remove the last right leaf record > of CNT btree will update 'pag->pagf_longest' with zero. This process is hold > AGF buffer's lock. > > Above test case constructs repeatedly allocate space within > the same AG, increasing the concurrency between the two processes. > To solve above issue, there's need to hold AGF buffer's lock before call > xfs_alloc_space_available() to judge space is available for request. Please think a little more carefully about what the code you are changing actually does. xfs_alloc_space_available() uses a "check-lock-check again" algorithm, and so moving the locking so it is a "lock-check-check again" makes the second set of checks redundant..... ---- Regardless, as I implied above about the AG selection scanning, I don't think this is the right way to fix this issue. We don't want AGs at ENOSPC to require scanners to take the AGF lock to determine if they really are at ENOSPC or not. This adds significant overhead and latency to the allocation algorithm, especially as AG count climbs into the thousands and allocations may have to scan hundreds to thousands of AGs as we near ENOSPC to find the best candidate to allocate from. What we actually want is for pag->pagf_longest not to change transiently to zero in xfs_alloc_fixup_trees(). If the delrec that zeroes the pagf_longest field is going to follow it up with an insrec that will set it back to a valid value, we really should not be doing the zeroing in the first place. Further, the only btree that tracks the right edge of the btree is the by-count allocbt. This isn't "generic" functionality, even though it is implemented through the generic btree code. If we lift ->update_lastrec from the generic code and do it directly in xfs_alloc.c whenever we are finished with a by-count tree update and the cursor points to a record in the right-most leaf of the tree, then we run the lastrec update directly at that point. By decoupling the lastrec updates from the individual record manipulations, we can make the transients disappear completely. This means all the unlocked AG scanning and the the check-lock-check again algorithm in xfs_alloc_space_available() would never see transients and so work reliably in these rare post-growfs-from-total-ENOSPC situations... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx