On Sat, Aug 26, 2023 at 01:37:24PM +1000, Dave Chinner wrote: > On Thu, Aug 24, 2023 at 03:01:54PM -0700, Darrick J. Wong wrote: > > On Thu, Aug 24, 2023 at 05:28:23PM +1000, Dave Chinner wrote: > > > On Wed, Aug 23, 2023 at 09:52:34PM -0700, Darrick J. Wong wrote: > > > > On Fri, Aug 18, 2023 at 03:25:46AM +0000, Wengang Wang wrote: > > > > > > > > Since xfs_efi_item_recover is only performing one step of what could be > > > > a chain of deferred updates, it never rolls the transaction that it > > > > creates. It therefore only requires the amount of grant space that > > > > you'd get with tr_logcount == 1. It is therefore a bit silly that we > > > > ask for more than that, and in bad cases like this, hang log recovery > > > > needlessly. > > > > > > But this doesn't fix the whatever problem lead to the recovery not > > > having the same full tr_itruncate reservation available as was held > > > by the transaction that logged the EFI and was running the extent > > > free at the time the system crashed. There should -always- be enough > > > transaction reservation space in the journal to reserve space for an > > > intent replay if the intent recovery reservation uses the same > > > reservation type as runtime. > > > > > > Hence I think this is still just a band-aid over whatever went wrong > > > at runtime that lead to the log not having enough space for a > > > reservation that was clearly held at runtime and hadn't yet used. > > > > Maybe I'm not remembering accurately how permanent log reservations > > work. Let's continue picking on tr_itruncate from Wengang's example. > > IIRC, he said that tr_itruncate on the running system was defined > > roughly like so: > > > > tr_itruncate = { > > .tr_logres = 180K > > .tr_logcount = 2, > > .tr_logflags = XFS_TRANS_PERM_LOG_RES, > > } > > > > At runtime, when we want to start a truncation update, we do this: > > > > xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, ...); > > > > Call sequence: xfs_trans_alloc -> xfs_trans_reserve -> xfs_log_reserve > > -> xlog_ticket_alloc. Ticket allocation assigns tic->t_unit_res = > > (tr_logres + overhead) and tic->t_cnt = tr_logcount. > > > > (Let's pretend for the sake of argument that the overhead is 5K.) > > > > Now xfs_log_reserve calls xlog_grant_head_check. That does: > > > > *need_bytes = xlog_ticket_reservation(log, head, tic); > > > > For the reserve head, the ticket reservation computation is > > (tic->t_unit_res * tic->t_cnt), which in this case is (185K * 2) == > > 370K, right? So we make sure there's at least 370K free in the reserve > > head, then add that to the reserve and write heads. > > > > Now that we've allocated the transaction, delete the bmap mapping, > > log an EFI to free the space, and roll the transaction as part of > > finishing the deferops chain. Rolling creates a new xfs_trans which > > shares its ticket with the old transaction. Next, xfs_trans_roll calls > > __xfs_trans_commit with regrant == true, which calls xlog_cil_commit > > with the same regrant parameter. > > > > xlog_cil_commit calls xfs_log_ticket_regrant, which decrements t_cnt and > > subtracts t_curr_res from the reservation and write heads. > > > > If the filesystem is fresh and the first transaction only used (say) > > 20K, then t_curr_res will be 165K, and we give that much reservation > > back to the reservation head. Or if the file is really fragmented and > > the first transaction actually uses 170K, then t_curr_res will be 15K, > > and that's what we give back to the reservation. > > > > Having done that, we're now headed into the second transaction with an > > EFI and 185K of reservation, correct? > > Ah, right, I overlooked that the long running truncates only regrant a > single reservation unit at a time when they roll, and the runtime > instances I seen of this are with long running truncate operations > (i.e. inactivation after unlink for multi-thousand extent inodes). > > So, yes, all types of intent recovery must only use a single unit > reservation, not just EFIs, because there is no guarantee that there > is a full unit * count reservation available in the journal whenever > that intent was first logged. Indeed, at best it will be 'count - 1' > that is avaialble, becuase the transaction that logged the intent > will have used a full unit of the original reservation.... <nod> I think something like this will fix the problem, right? diff --git a/fs/xfs/libxfs/xfs_log_recover.h b/fs/xfs/libxfs/xfs_log_recover.h index 2420865f3007..a5100a11faf9 100644 --- a/fs/xfs/libxfs/xfs_log_recover.h +++ b/fs/xfs/libxfs/xfs_log_recover.h @@ -131,4 +131,26 @@ void xlog_check_buf_cancel_table(struct xlog *log); #define xlog_check_buf_cancel_table(log) do { } while (0) #endif +/* + * Transform a regular reservation into one suitable for recovery of a log + * intent item. + * + * Intent recovery only runs a single step of the transaction chain and defers + * the rest to a separate transaction. Therefore, we reduce logcount to 1 here + * to avoid livelocks if the log grant space is nearly exhausted due to the + * recovered intent pinning the tail. Keep the same logflags to avoid tripping + * asserts elsewhere. Struct copies abound below. + */ +static inline struct xfs_trans_res +xlog_recover_resv(const struct xfs_trans_res *r) +{ + struct xfs_trans_res ret = { + .tr_logres = r->tr_logres, + .tr_logcount = 1, + .tr_logflags = r->tr_logflags, + }; + + return ret; +} + #endif /* __XFS_LOG_RECOVER_H__ */ diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c index 2788a6f2edcd..36fe2abb16e6 100644 --- a/fs/xfs/xfs_attr_item.c +++ b/fs/xfs/xfs_attr_item.c @@ -547,7 +547,7 @@ xfs_attri_item_recover( struct xfs_inode *ip; struct xfs_da_args *args; struct xfs_trans *tp; - struct xfs_trans_res tres; + struct xfs_trans_res resv; struct xfs_attri_log_format *attrp; struct xfs_attri_log_nameval *nv = attrip->attri_nameval; int error; @@ -618,8 +618,9 @@ xfs_attri_item_recover( goto out; } - xfs_init_attr_trans(args, &tres, &total); - error = xfs_trans_alloc(mp, &tres, total, 0, XFS_TRANS_RESERVE, &tp); + xfs_init_attr_trans(args, &resv, &total); + resv = xlog_recover_resv(&resv); + error = xfs_trans_alloc(mp, &resv, total, 0, XFS_TRANS_RESERVE, &tp); if (error) goto out; diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c index 7551c3ec4ea5..e736a0844c89 100644 --- a/fs/xfs/xfs_bmap_item.c +++ b/fs/xfs/xfs_bmap_item.c @@ -490,6 +490,7 @@ xfs_bui_item_recover( struct list_head *capture_list) { struct xfs_bmap_intent fake = { }; + struct xfs_trans_res resv; struct xfs_bui_log_item *buip = BUI_ITEM(lip); struct xfs_trans *tp; struct xfs_inode *ip = NULL; @@ -515,7 +516,8 @@ xfs_bui_item_recover( return error; /* Allocate transaction and do the work. */ - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, + resv = xlog_recover_resv(&M_RES(mp)->tr_itruncate); + error = xfs_trans_alloc(mp, &resv, XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), 0, 0, &tp); if (error) goto err_rele; diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c index f1a5ecf099aa..3fa8789820ad 100644 --- a/fs/xfs/xfs_extfree_item.c +++ b/fs/xfs/xfs_extfree_item.c @@ -660,6 +660,7 @@ xfs_efi_item_recover( struct xfs_log_item *lip, struct list_head *capture_list) { + struct xfs_trans_res resv; struct xfs_efi_log_item *efip = EFI_ITEM(lip); struct xfs_mount *mp = lip->li_log->l_mp; struct xfs_efd_log_item *efdp; @@ -683,7 +684,8 @@ xfs_efi_item_recover( } } - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp); + resv = xlog_recover_resv(&M_RES(mp)->tr_itruncate); + error = xfs_trans_alloc(mp, &resv, 0, 0, 0, &tp); if (error) return error; efdp = xfs_trans_get_efd(tp, efip, efip->efi_format.efi_nextents); diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c index edd8587658d5..2d4444d61e98 100644 --- a/fs/xfs/xfs_refcount_item.c +++ b/fs/xfs/xfs_refcount_item.c @@ -477,6 +477,7 @@ xfs_cui_item_recover( struct xfs_log_item *lip, struct list_head *capture_list) { + struct xfs_trans_res resv; struct xfs_cui_log_item *cuip = CUI_ITEM(lip); struct xfs_cud_log_item *cudp; struct xfs_trans *tp; @@ -514,8 +515,9 @@ xfs_cui_item_recover( * doesn't fit. We need to reserve enough blocks to handle a * full btree split on either end of the refcount range. */ - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, - mp->m_refc_maxlevels * 2, 0, XFS_TRANS_RESERVE, &tp); + resv = xlog_recover_resv(&M_RES(mp)->tr_itruncate); + error = xfs_trans_alloc(mp, &resv, mp->m_refc_maxlevels * 2, 0, + XFS_TRANS_RESERVE, &tp); if (error) return error; diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c index 520c7ebdfed8..0e0e747028da 100644 --- a/fs/xfs/xfs_rmap_item.c +++ b/fs/xfs/xfs_rmap_item.c @@ -507,6 +507,7 @@ xfs_rui_item_recover( struct xfs_log_item *lip, struct list_head *capture_list) { + struct xfs_trans_res resv; struct xfs_rui_log_item *ruip = RUI_ITEM(lip); struct xfs_rud_log_item *rudp; struct xfs_trans *tp; @@ -530,8 +531,9 @@ xfs_rui_item_recover( } } - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, - mp->m_rmap_maxlevels, 0, XFS_TRANS_RESERVE, &tp); + resv = xlog_recover_resv(&M_RES(mp)->tr_itruncate); + error = xfs_trans_alloc(mp, &resv, mp->m_rmap_maxlevels, 0, + XFS_TRANS_RESERVE, &tp); if (error) return error; rudp = xfs_trans_get_rud(tp, ruip); > > IOWs, I'm operating on an assumption that we have two problems to solve: > > the runtime acconting bug that you've been chasing, and Wengang's where > > log recovery asks for more log space than what had been in the log > > ticket at the time of the crash. > > OK, yes, it does seem there are two problems here... > > > > I suspect that the cause of this recovery issue is the we require an > > > overcommit of the log space accounting before we throttle incoming > > > transaction reservations (i.e. the new reservation has to overrun > > > before we throttle). I think that the reservation accounting overrun > > > detection can race to the first item being placed on the wait list, > > > > Yeah, I was wondering that myself when I was looking at the logic > > between list_empty_careful and the second xlog_grant_head_wait and > > wondering if that "careful" construction actually worked. > > Well, it's not the careful check that is the problem; the race is > much simpler than that. We just have to have two concurrent > reservations that will both fit in the remaining log space > individually but not together. i.e. this is the overcommit race > window: > > P1 P2 > --------------------------- -------------------------- > xfs_log_reserve(10000 bytes) > xlog_grant_head_check() > xlog_space_left() > <sees 12000 bytes> > OK, doesn't wait > xfs_log_reserve(10000 bytes) > xlog_grant_head_check() > xlog_space_left() > <sees 12000 bytes> > OK, doesn't wait > xlog_grant_add_space(10k) > xlog_grant_add_space(10k) > > And now we've overcommitted the log by 8000 bytes.... > > The current byte based grant head patch set does not address this; > the attempt I made to fix it didn't work properly, so I split the > rework of the throttling out of the byte based accounting patchset > (which seems to work correctly without reworking > xlog_grant_head_check()) and now I'm trying to address this race > condition as a separate patchset... Oh. Heh. Yeah. --D > -Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx