On 01 Jun 2021 at 04:58, Dave Chinner wrote: > On Mon, May 31, 2021 at 06:35:40PM +0530, Chandan Babu R wrote: >> On 31 May 2021 at 18:25, Chandan Babu R wrote: >> > On 27 May 2021 at 10:21, Dave Chinner wrote: >> >> From: Dave Chinner <dchinner@xxxxxxxxxx> >> >> >> >> Ever since we moved to freeing of extents by deferred operations, >> >> we've already freed extents via individual transactions. Hence the >> >> only limitation of how many extents we can mark for freeing in a >> >> single xfs_bunmapi() call bound only by how many deferrals we want >> >> to queue. >> >> >> >> That is xfs_bunmapi() doesn't actually do any AG based extent >> >> freeing, so there's no actually transaction reservation used up by >> >> calling bunmapi with a large count of extents to be freed. RT >> >> extents have always been freed directly by bunmapi, but that doesn't >> >> require modification of large number of blocks as there are no >> >> btrees to split. >> >> >> >> Some callers of xfs_bunmapi assume that the extent count being freed >> >> is bound by geometry (e.g. directories) and these can ask bunmapi to >> >> free up to 64 extents in a single call. These functions just work as >> >> tehy stand, so there's no reason for truncate to have a limit of >> >> just two extents per bunmapi call anymore. >> >> >> >> Increase XFS_ITRUNC_MAX_EXTENTS to 64 to match the number of extents >> >> that can be deferred in a single loop to match what the directory >> >> code already uses. >> >> >> >> For realtime inodes, where xfs_bunmapi() directly frees extents, >> >> leave the limit at 2 extents per loop as this is all the space that >> >> the transaction reservation will cover. >> >> >> >> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> >> >> --- >> >> fs/xfs/xfs_inode.c | 15 ++++++++++++--- >> >> 1 file changed, 12 insertions(+), 3 deletions(-) >> >> >> >> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c >> >> index 0369eb22c1bb..db220eaa34b8 100644 >> >> --- a/fs/xfs/xfs_inode.c >> >> +++ b/fs/xfs/xfs_inode.c >> >> @@ -40,9 +40,18 @@ kmem_zone_t *xfs_inode_zone; >> >> >> >> /* >> >> * Used in xfs_itruncate_extents(). This is the maximum number of extents >> >> - * freed from a file in a single transaction. >> >> + * we will unmap and defer for freeing in a single call to xfs_bunmapi(). >> >> + * Realtime inodes directly free extents in xfs_bunmapi(), so are bound >> >> + * by transaction reservation size to 2 extents. >> >> */ >> >> -#define XFS_ITRUNC_MAX_EXTENTS 2 >> >> +static inline int >> >> +xfs_itrunc_max_extents( >> >> + struct xfs_inode *ip) >> >> +{ >> >> + if (XFS_IS_REALTIME_INODE(ip)) >> >> + return 2; >> >> + return 64; >> >> +} >> >> >> >> STATIC int xfs_iunlink(struct xfs_trans *, struct xfs_inode *); >> >> STATIC int xfs_iunlink_remove(struct xfs_trans *, struct xfs_inode *); >> >> @@ -1402,7 +1411,7 @@ xfs_itruncate_extents_flags( >> >> while (unmap_len > 0) { >> >> ASSERT(tp->t_firstblock == NULLFSBLOCK); >> >> error = __xfs_bunmapi(tp, ip, first_unmap_block, &unmap_len, >> >> - flags, XFS_ITRUNC_MAX_EXTENTS); >> >> + flags, xfs_itrunc_max_extents(ip)); >> >> if (error) >> >> goto out; >> > >> > The list of free extent items at xfs_defer_pending->dfp_work could >> > now contain XFS_EFI_MAX_FAST_EXTENTS (i.e. 16) entries in the worst case. > > Yes, but we do exactly this when freeing a large fragmented directly > block. That is, we ask xfs_bunmapi to unmap a 64kB range regardless > of how many extents map that range. > > IOWs, the limitation in extent count placed in > xfs_itruncate_extents() doesn't actually address the underlying > problem - it's just a band-aid that has been placed over the easy to > trigger transaction overrun symptom that has always been present in > the underlying extent freeing code. > >> > For a single transaction, xfs_calc_itruncate_reservation() reserves space for >> > logging only 4 extents (i.e. 4 exts * 2 trees * (2 * max depth - 1) * block >> > size). >> >> ... Sorry, I meant to say "xfs_calc_itruncate_reservation() reserves log space >> required for freeing 4 extents ..." > > My point exactly - the code has always had a mismatch between > reservations and what we can stuff into an EFI. EFIs are unbound in > size, so having a fixed "4 extents per EFI" reservation limit has > never made any real sense given that unmaping a 64kB directory block > on a 1kb block size filesystem has a worst case of freeing 64 > extents in a single transaction. As I said above, the limitations > placed on xfs_itruncate_extents is largely a hack because it doesn't > address other avenues to the same overruns... > Ok. Thanks for the explanation. > This "4 extents per transaction" reservation makes even less sense > now that extents are freed by defer ops, not by the transaction that > unmaps them. We've completely decoupled extent freeing from the > higher level code that unmaps them, and so now the freeing > transactions are independent of the high level code that runs the > freeing operations. > > IOWs, having a reservation big enough to free a single extent is all > we should need now as we can break the extent freeing up into > individual transactions and still have them complete atomically even > after a crash. That's where I'm trying to get to here, but it's > clear that I need to refine the code a bit further such that we only > allow individual EFIs to queue up and free multiple extents within > an AG to be freed in the same transaction.... Ok. That should limit the number of CNTBT/BNOBT blocks being logged by a transaction which is freeing extents. > > FWIW, I suspect that the right thing to do here is make use of the > xfs_defer_finish_one() mechanism for relogging the remaining intents > in work list while it is processing that list. All we need is for > xfs_extent_free_finish_item() to return -EAGAIN instead of running > the free transaction, and we will log the completed EFDs and the > remaining EFIs to be run and roll the transaction. > > Hence we can control exactly when we roll the extent freeing > transaction (e.g. when the next extent to free is in a different AG > to the one we've just been freeing extents in) and as a result we > can define a fixed transaction reservation for extent freeing that > works in every situation without having to care about arbitrary > "how many extents can we unmap without overrun" concerns. > That sounds perfect. > That's the goal here - fixed, small reservation for extent freeing > that is decoupled from and independent of the number of extents that > need to be freed atomically by the high level operation... > > Hence I think that a minor amount of rework will allow the EFI code > to log large numbers of extents to be freed whilst still processing > them within single AG free space tree modification reservation > bounds... -- chandan