Re: Question: reserve log space at IO time for recover

Dave Chinner <david@xxxxxxxxxxxxx> · Sat, 26 Aug 2023 13:37:24 +1000

On Thu, Aug 24, 2023 at 03:01:54PM -0700, Darrick J. Wong wrote:
> On Thu, Aug 24, 2023 at 05:28:23PM +1000, Dave Chinner wrote:
> > On Wed, Aug 23, 2023 at 09:52:34PM -0700, Darrick J. Wong wrote:
> > > On Fri, Aug 18, 2023 at 03:25:46AM +0000, Wengang Wang wrote:
> > > 
> > > Since xfs_efi_item_recover is only performing one step of what could be
> > > a chain of deferred updates, it never rolls the transaction that it
> > > creates.  It therefore only requires the amount of grant space that
> > > you'd get with tr_logcount == 1.  It is therefore a bit silly that we
> > > ask for more than that, and in bad cases like this, hang log recovery
> > > needlessly.
> > 
> > But this doesn't fix the whatever problem lead to the recovery not
> > having the same full tr_itruncate reservation available as was held
> > by the transaction that logged the EFI and was running the extent
> > free at the time the system crashed. There should -always- be enough
> > transaction reservation space in the journal to reserve space for an
> > intent replay if the intent recovery reservation uses the same
> > reservation type as runtime.
> > 
> > Hence I think this is still just a band-aid over whatever went wrong
> > at runtime that lead to the log not having enough space for a
> > reservation that was clearly held at runtime and hadn't yet used.
> 
> Maybe I'm not remembering accurately how permanent log reservations
> work.  Let's continue picking on tr_itruncate from Wengang's example.
> IIRC, he said that tr_itruncate on the running system was defined
> roughly like so:
> 
> tr_itruncate = {
> 	.tr_logres	= 180K
> 	.tr_logcount	= 2,
> 	.tr_logflags	= XFS_TRANS_PERM_LOG_RES,
> }
> 
> At runtime, when we want to start a truncation update, we do this:
> 
> 	xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, ...);
> 
> Call sequence: xfs_trans_alloc -> xfs_trans_reserve -> xfs_log_reserve
> -> xlog_ticket_alloc.  Ticket allocation assigns tic->t_unit_res =
> (tr_logres + overhead) and tic->t_cnt = tr_logcount.
> 
> (Let's pretend for the sake of argument that the overhead is 5K.)
> 
> Now xfs_log_reserve calls xlog_grant_head_check.  That does:
> 
> 	*need_bytes = xlog_ticket_reservation(log, head, tic);
> 
> For the reserve head, the ticket reservation computation is
> (tic->t_unit_res * tic->t_cnt), which in this case is (185K * 2) ==
> 370K, right?  So we make sure there's at least 370K free in the reserve
> head, then add that to the reserve and write heads.
> 
> Now that we've allocated the transaction, delete the bmap mapping,
> log an EFI to free the space, and roll the transaction as part of
> finishing the deferops chain.  Rolling creates a new xfs_trans which
> shares its ticket with the old transaction.  Next, xfs_trans_roll calls
> __xfs_trans_commit with regrant == true, which calls xlog_cil_commit
> with the same regrant parameter.
> 
> xlog_cil_commit calls xfs_log_ticket_regrant, which decrements t_cnt and
> subtracts t_curr_res from the reservation and write heads.
> 
> If the filesystem is fresh and the first transaction only used (say)
> 20K, then t_curr_res will be 165K, and we give that much reservation
> back to the reservation head.  Or if the file is really fragmented and
> the first transaction actually uses 170K, then t_curr_res will be 15K,
> and that's what we give back to the reservation.
> 
> Having done that, we're now headed into the second transaction with an
> EFI and 185K of reservation, correct?

Ah, right, I overlooked that the long running truncates only regrant a
single reservation unit at a time when they roll, and the runtime
instances I seen of this are with long running truncate operations
(i.e. inactivation after unlink for multi-thousand extent inodes).

So, yes, all types of intent recovery must only use a single unit
reservation, not just EFIs, because there is no guarantee that there
is a full unit * count reservation available in the journal whenever
that intent was first logged. Indeed, at best it will be 'count - 1'
that is avaialble, becuase the transaction that logged the intent
will have used a full unit of the original reservation....

> IOWs, I'm operating on an assumption that we have two problems to solve:
> the runtime acconting bug that you've been chasing, and Wengang's where
> log recovery asks for more log space than what had been in the log
> ticket at the time of the crash.

OK, yes, it does seem there are two problems here...

> > I suspect that the cause of this recovery issue is the we require an
> > overcommit of the log space accounting before we throttle incoming
> > transaction reservations (i.e. the new reservation has to overrun
> > before we throttle). I think that the reservation accounting overrun
> > detection can race to the first item being placed on the wait list,
> 
> Yeah, I was wondering that myself when I was looking at the logic
> between list_empty_careful and the second xlog_grant_head_wait and
> wondering if that "careful" construction actually worked.

Well, it's not the careful check that is the problem; the race is
much simpler than that. We just have to have two concurrent
reservations that will both fit in the remaining log space
individually but not together. i.e. this is the overcommit race
window:

P1				P2
---------------------------	--------------------------
xfs_log_reserve(10000 bytes)
  xlog_grant_head_check()
    xlog_space_left()
      <sees 12000 bytes>
    OK, doesn't wait
  				xfs_log_reserve(10000 bytes)
				  xlog_grant_head_check()
				    xlog_space_left()
				      <sees 12000 bytes>
				    OK, doesn't wait
    xlog_grant_add_space(10k)
    				  xlog_grant_add_space(10k)

And now we've overcommitted the log by 8000 bytes....

The current byte based grant head patch set does not address this;
the attempt I made to fix it didn't work properly, so I split the
rework of the throttling out of the byte based accounting patchset
(which seems to work correctly without reworking
xlog_grant_head_check()) and now I'm trying to address this race
condition as a separate patchset...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx