Re: Question: reserve log space at IO time for recover

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Sun, 27 Aug 2023 09:04:30 -0700

On Sat, Aug 26, 2023 at 01:37:24PM +1000, Dave Chinner wrote:
> On Thu, Aug 24, 2023 at 03:01:54PM -0700, Darrick J. Wong wrote:
> > On Thu, Aug 24, 2023 at 05:28:23PM +1000, Dave Chinner wrote:
> > > On Wed, Aug 23, 2023 at 09:52:34PM -0700, Darrick J. Wong wrote:
> > > > On Fri, Aug 18, 2023 at 03:25:46AM +0000, Wengang Wang wrote:
> > > > 
> > > > Since xfs_efi_item_recover is only performing one step of what could be
> > > > a chain of deferred updates, it never rolls the transaction that it
> > > > creates.  It therefore only requires the amount of grant space that
> > > > you'd get with tr_logcount == 1.  It is therefore a bit silly that we
> > > > ask for more than that, and in bad cases like this, hang log recovery
> > > > needlessly.
> > > 
> > > But this doesn't fix the whatever problem lead to the recovery not
> > > having the same full tr_itruncate reservation available as was held
> > > by the transaction that logged the EFI and was running the extent
> > > free at the time the system crashed. There should -always- be enough
> > > transaction reservation space in the journal to reserve space for an
> > > intent replay if the intent recovery reservation uses the same
> > > reservation type as runtime.
> > > 
> > > Hence I think this is still just a band-aid over whatever went wrong
> > > at runtime that lead to the log not having enough space for a
> > > reservation that was clearly held at runtime and hadn't yet used.
> > 
> > Maybe I'm not remembering accurately how permanent log reservations
> > work.  Let's continue picking on tr_itruncate from Wengang's example.
> > IIRC, he said that tr_itruncate on the running system was defined
> > roughly like so:
> > 
> > tr_itruncate = {
> > 	.tr_logres	= 180K
> > 	.tr_logcount	= 2,
> > 	.tr_logflags	= XFS_TRANS_PERM_LOG_RES,
> > }
> > 
> > At runtime, when we want to start a truncation update, we do this:
> > 
> > 	xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, ...);
> > 
> > Call sequence: xfs_trans_alloc -> xfs_trans_reserve -> xfs_log_reserve
> > -> xlog_ticket_alloc.  Ticket allocation assigns tic->t_unit_res =
> > (tr_logres + overhead) and tic->t_cnt = tr_logcount.
> > 
> > (Let's pretend for the sake of argument that the overhead is 5K.)
> > 
> > Now xfs_log_reserve calls xlog_grant_head_check.  That does:
> > 
> > 	*need_bytes = xlog_ticket_reservation(log, head, tic);
> > 
> > For the reserve head, the ticket reservation computation is
> > (tic->t_unit_res * tic->t_cnt), which in this case is (185K * 2) ==
> > 370K, right?  So we make sure there's at least 370K free in the reserve
> > head, then add that to the reserve and write heads.
> > 
> > Now that we've allocated the transaction, delete the bmap mapping,
> > log an EFI to free the space, and roll the transaction as part of
> > finishing the deferops chain.  Rolling creates a new xfs_trans which
> > shares its ticket with the old transaction.  Next, xfs_trans_roll calls
> > __xfs_trans_commit with regrant == true, which calls xlog_cil_commit
> > with the same regrant parameter.
> > 
> > xlog_cil_commit calls xfs_log_ticket_regrant, which decrements t_cnt and
> > subtracts t_curr_res from the reservation and write heads.
> > 
> > If the filesystem is fresh and the first transaction only used (say)
> > 20K, then t_curr_res will be 165K, and we give that much reservation
> > back to the reservation head.  Or if the file is really fragmented and
> > the first transaction actually uses 170K, then t_curr_res will be 15K,
> > and that's what we give back to the reservation.
> > 
> > Having done that, we're now headed into the second transaction with an
> > EFI and 185K of reservation, correct?
> 
> Ah, right, I overlooked that the long running truncates only regrant a
> single reservation unit at a time when they roll, and the runtime
> instances I seen of this are with long running truncate operations
> (i.e. inactivation after unlink for multi-thousand extent inodes).
> 
> So, yes, all types of intent recovery must only use a single unit
> reservation, not just EFIs, because there is no guarantee that there
> is a full unit * count reservation available in the journal whenever
> that intent was first logged. Indeed, at best it will be 'count - 1'
> that is avaialble, becuase the transaction that logged the intent
> will have used a full unit of the original reservation....

<nod> I think something like this will fix the problem, right?

diff --git a/fs/xfs/libxfs/xfs_log_recover.h b/fs/xfs/libxfs/xfs_log_recover.h
index 2420865f3007..a5100a11faf9 100644
--- a/fs/xfs/libxfs/xfs_log_recover.h
+++ b/fs/xfs/libxfs/xfs_log_recover.h
@@ -131,4 +131,26 @@ void xlog_check_buf_cancel_table(struct xlog *log);
 #define xlog_check_buf_cancel_table(log) do { } while (0)
 #endif
 
+/*
+ * Transform a regular reservation into one suitable for recovery of a log
+ * intent item.
+ *
+ * Intent recovery only runs a single step of the transaction chain and defers
+ * the rest to a separate transaction.  Therefore, we reduce logcount to 1 here
+ * to avoid livelocks if the log grant space is nearly exhausted due to the
+ * recovered intent pinning the tail.  Keep the same logflags to avoid tripping
+ * asserts elsewhere.  Struct copies abound below.
+ */
+static inline struct xfs_trans_res
+xlog_recover_resv(const struct xfs_trans_res *r)
+{
+	struct xfs_trans_res ret = {
+		.tr_logres	= r->tr_logres,
+		.tr_logcount	= 1,
+		.tr_logflags	= r->tr_logflags,
+	};
+
+	return ret;
+}
+
 #endif	/* __XFS_LOG_RECOVER_H__ */
diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c
index 2788a6f2edcd..36fe2abb16e6 100644
--- a/fs/xfs/xfs_attr_item.c
+++ b/fs/xfs/xfs_attr_item.c
@@ -547,7 +547,7 @@ xfs_attri_item_recover(
 	struct xfs_inode		*ip;
 	struct xfs_da_args		*args;
 	struct xfs_trans		*tp;
-	struct xfs_trans_res		tres;
+	struct xfs_trans_res		resv;
 	struct xfs_attri_log_format	*attrp;
 	struct xfs_attri_log_nameval	*nv = attrip->attri_nameval;
 	int				error;
@@ -618,8 +618,9 @@ xfs_attri_item_recover(
 		goto out;
 	}
 
-	xfs_init_attr_trans(args, &tres, &total);
-	error = xfs_trans_alloc(mp, &tres, total, 0, XFS_TRANS_RESERVE, &tp);
+	xfs_init_attr_trans(args, &resv, &total);
+	resv = xlog_recover_resv(&resv);
+	error = xfs_trans_alloc(mp, &resv, total, 0, XFS_TRANS_RESERVE, &tp);
 	if (error)
 		goto out;
 
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 7551c3ec4ea5..e736a0844c89 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -490,6 +490,7 @@ xfs_bui_item_recover(
 	struct list_head		*capture_list)
 {
 	struct xfs_bmap_intent		fake = { };
+	struct xfs_trans_res		resv;
 	struct xfs_bui_log_item		*buip = BUI_ITEM(lip);
 	struct xfs_trans		*tp;
 	struct xfs_inode		*ip = NULL;
@@ -515,7 +516,8 @@ xfs_bui_item_recover(
 		return error;
 
 	/* Allocate transaction and do the work. */
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate,
+	resv = xlog_recover_resv(&M_RES(mp)->tr_itruncate);
+	error = xfs_trans_alloc(mp, &resv,
 			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), 0, 0, &tp);
 	if (error)
 		goto err_rele;
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index f1a5ecf099aa..3fa8789820ad 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -660,6 +660,7 @@ xfs_efi_item_recover(
 	struct xfs_log_item		*lip,
 	struct list_head		*capture_list)
 {
+	struct xfs_trans_res		resv;
 	struct xfs_efi_log_item		*efip = EFI_ITEM(lip);
 	struct xfs_mount		*mp = lip->li_log->l_mp;
 	struct xfs_efd_log_item		*efdp;
@@ -683,7 +684,8 @@ xfs_efi_item_recover(
 		}
 	}
 
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
+	resv = xlog_recover_resv(&M_RES(mp)->tr_itruncate);
+	error = xfs_trans_alloc(mp, &resv, 0, 0, 0, &tp);
 	if (error)
 		return error;
 	efdp = xfs_trans_get_efd(tp, efip, efip->efi_format.efi_nextents);
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index edd8587658d5..2d4444d61e98 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -477,6 +477,7 @@ xfs_cui_item_recover(
 	struct xfs_log_item		*lip,
 	struct list_head		*capture_list)
 {
+	struct xfs_trans_res		resv;
 	struct xfs_cui_log_item		*cuip = CUI_ITEM(lip);
 	struct xfs_cud_log_item		*cudp;
 	struct xfs_trans		*tp;
@@ -514,8 +515,9 @@ xfs_cui_item_recover(
 	 * doesn't fit.  We need to reserve enough blocks to handle a
 	 * full btree split on either end of the refcount range.
 	 */
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate,
-			mp->m_refc_maxlevels * 2, 0, XFS_TRANS_RESERVE, &tp);
+	resv = xlog_recover_resv(&M_RES(mp)->tr_itruncate);
+	error = xfs_trans_alloc(mp, &resv, mp->m_refc_maxlevels * 2, 0,
+			XFS_TRANS_RESERVE, &tp);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index 520c7ebdfed8..0e0e747028da 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -507,6 +507,7 @@ xfs_rui_item_recover(
 	struct xfs_log_item		*lip,
 	struct list_head		*capture_list)
 {
+	struct xfs_trans_res		resv;
 	struct xfs_rui_log_item		*ruip = RUI_ITEM(lip);
 	struct xfs_rud_log_item		*rudp;
 	struct xfs_trans		*tp;
@@ -530,8 +531,9 @@ xfs_rui_item_recover(
 		}
 	}
 
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate,
-			mp->m_rmap_maxlevels, 0, XFS_TRANS_RESERVE, &tp);
+	resv = xlog_recover_resv(&M_RES(mp)->tr_itruncate);
+	error = xfs_trans_alloc(mp, &resv, mp->m_rmap_maxlevels, 0,
+			XFS_TRANS_RESERVE, &tp);
 	if (error)
 		return error;
 	rudp = xfs_trans_get_rud(tp, ruip);

> > IOWs, I'm operating on an assumption that we have two problems to solve:
> > the runtime acconting bug that you've been chasing, and Wengang's where
> > log recovery asks for more log space than what had been in the log
> > ticket at the time of the crash.
> 
> OK, yes, it does seem there are two problems here...
> 
> > > I suspect that the cause of this recovery issue is the we require an
> > > overcommit of the log space accounting before we throttle incoming
> > > transaction reservations (i.e. the new reservation has to overrun
> > > before we throttle). I think that the reservation accounting overrun
> > > detection can race to the first item being placed on the wait list,
> > 
> > Yeah, I was wondering that myself when I was looking at the logic
> > between list_empty_careful and the second xlog_grant_head_wait and
> > wondering if that "careful" construction actually worked.
> 
> Well, it's not the careful check that is the problem; the race is
> much simpler than that. We just have to have two concurrent
> reservations that will both fit in the remaining log space
> individually but not together. i.e. this is the overcommit race
> window:
> 
> P1				P2
> ---------------------------	--------------------------
> xfs_log_reserve(10000 bytes)
>   xlog_grant_head_check()
>     xlog_space_left()
>       <sees 12000 bytes>
>     OK, doesn't wait
>   				xfs_log_reserve(10000 bytes)
> 				  xlog_grant_head_check()
> 				    xlog_space_left()
> 				      <sees 12000 bytes>
> 				    OK, doesn't wait
>     xlog_grant_add_space(10k)
>     				  xlog_grant_add_space(10k)
> 
> And now we've overcommitted the log by 8000 bytes....
> 
> The current byte based grant head patch set does not address this;
> the attempt I made to fix it didn't work properly, so I split the
> rework of the throttling out of the byte based accounting patchset
> (which seems to work correctly without reworking
> xlog_grant_head_check()) and now I'm trying to address this race
> condition as a separate patchset...

Oh.  Heh.  Yeah.

--D

> -Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx