On Tue, Jun 25, 2024 at 09:46:51AM +0800, Zizhi Wo wrote: > A concurrent file creation and little writing could unexpectedly return > -ENOSPC error since there is a race window that the allocator could get > the wrong agf->agf_longest. > > Write file process steps: > 1) Find the entry that best meets the conditions, then calculate the start > address and length of the remaining part of the entry after allocation. > 2) Delete this entry and update the -current- agf->agf_longest. > 3) Insert the remaining unused parts of this entry based on the > calculations in 1), and update the agf->agf_longest again if necessary. > > Create file process steps: > 1) Check whether there are free inodes in the inode chunk. > 2) If there is no free inode, check whether there has space for creating > inode chunks, perform the no-lock judgment first. > 3) If the judgment succeeds, the judgment is performed again with agf lock > held. Otherwire, an error is returned directly. > > If the write process is in step 2) but not go to 3) yet, the create file > process goes to 2) at this time, it may be mistaken for no space, > resulting in the file system still has space but the file creation fails. > > We have sent two different commits to the community in order to fix this > problem[1][2]. Unfortunately, both solutions have flaws. In [2], I > discussed with Dave and Darrick, realized that a better solution to this > problem requires the "last cnt record tracking" to be ripped out of the > generic btree code. And surprisingly, Dave directly provided his fix code. > This patch includes appropriate modifications based on his tmp-code to > address this issue. > > The entire fix can be roughly divided into two parts: > 1) Delete the code related to lastrec-update in the generic btree code. > 2) Place the process of updating longest freespace with cntbt separately > to the end of the cntbt modifications. Move the cursor to the rightmost > firstly, and update the longest free extent based on the record. > > Note that we can not update the longest with xfs_alloc_get_rec() after > find the longest record, as xfs_verify_agbno() may not pass because > pag->block_count is updated on the outside. Therefore, use > xfs_btree_get_rec() as a replacement. > > [1] https://lore.kernel.org/all/20240419061848.1032366-2-yebin10@xxxxxxxxxx > [2] https://lore.kernel.org/all/20240604071121.3981686-1-wozizhi@xxxxxxxxxx > > Reported by: Ye Bin <yebin10@xxxxxxxxxx> > Signed-off-by: Zizhi Wo <wozizhi@xxxxxxxxxx> > --- > fs/xfs/libxfs/xfs_alloc.c | 115 ++++++++++++++++++++++++++++++++ > fs/xfs/libxfs/xfs_alloc_btree.c | 64 ------------------ > fs/xfs/libxfs/xfs_btree.c | 51 -------------- > fs/xfs/libxfs/xfs_btree.h | 16 +---- > 4 files changed, 116 insertions(+), 130 deletions(-) Mostly looks good. One small thing to fix, though. > +/* > + * Find the rightmost record of the cntbt, and return the longest free space > + * recorded in it. Simply set both the block number and the length to their > + * maximum values before searching. > + */ > +static int > +xfs_cntbt_longest( > + struct xfs_btree_cur *cnt_cur, > + xfs_extlen_t *longest) > +{ > + struct xfs_alloc_rec_incore irec; > + union xfs_btree_rec *rec; > + int stat = 0; > + int error; > + > + memset(&cnt_cur->bc_rec, 0xFF, sizeof(cnt_cur->bc_rec)); > + error = xfs_btree_lookup(cnt_cur, XFS_LOOKUP_LE, &stat); > + if (error) > + return error; > + if (!stat) { > + /* totally empty tree */ > + *longest = 0; > + return 0; > + } > + > + error = xfs_btree_get_rec(cnt_cur, &rec, &stat); > + if (error) > + return error; > + if (!stat) { > + ASSERT(0); > + *longest = 0; > + return 0; If we don't find a record, some kind of btree corruption has been encountered. Rather than "ASSERT(0)" here, this should fail in production systems in a way that admins and online repair will notice: if (XFS_IS_CORRUPT(mp, stat != 0)) { xfs_btree_mark_sick(cnt_cur); return -EFSCORRUPTED; } -Dave. -- Dave Chinner david@xxxxxxxxxxxxx