Re: [PATCH 7/7] xfs: push the grant head when the log head moves forward

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Mon, 16 Sep 2019 21:31:56 -0700

On Thu, Sep 12, 2019 at 09:46:06AM -0400, Brian Foster wrote:
> On Wed, Sep 11, 2019 at 09:38:58AM +1000, Dave Chinner wrote:
> > On Tue, Sep 10, 2019 at 05:56:28AM -0400, Brian Foster wrote:
> > > On Mon, Sep 09, 2019 at 09:26:32AM +1000, Dave Chinner wrote:
> > > > On Sat, Sep 07, 2019 at 11:10:50AM -0400, Brian Foster wrote:
> > > > > This is an instance of xfsaild going idle between the time this
> > > > > new AIL push sets the target based on the iclog about to be
> > > > > committed and AIL insertion of the associated log items,
> > > > > reproduced via a bit of timing instrumentation.  Don't be
> > > > > distracted by the timestamps or the fact that the LSNs do not
> > > > > match because the log items in the AIL end up indexed by the start
> > > > > lsn of the CIL checkpoint (whereas last_sync_lsn refers to the
> > > > > commit record). The point is simply that xfsaild has completed a
> > > > > push of a target that hasn't been inserted yet.
> > > > 
> > > > AFAICT, what you are showing requires delaying of the CIL push to the
> > > > point it violates a fundamental assumption about commit sizes, which
> > > > is why I largely think it's irrelevant.
> > > > 
> > > 
> > > The CIL checkpoint size is an unrelated side effect of the test I
> > > happened to use, not a fundamental cause of the problem it demonstrates.
> > > Fixing CIL checkpoint size issues won't change anything. Here's a
> > > different variant of the same problem with a small enough number of log
> > > items such that background CIL pushing is not a factor:
> > > 
> > >        <...>-79670 [000] ...1 56126.015522: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> > > kworker/0:1H-220   [000] ...1 56126.030587: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000032e4
> > > 	...
> > >        <...>-81293 [000] ...2 56126.032647: xfs_ail_delete: dev 253:4 lip 00000000cbe82125 old lsn 1/13026 new lsn 1/13026 type XFS_LI_INODE flags IN_AIL
> > >        <...>-81633 [000] .... 56126.053544: xfsaild: 588: idle ->ail_target 0x1000032e4
> > > kworker/0:1H-220   [000] ...2 56127.038835: xfs_ail_insert: dev 253:4 lip 00000000a44ab1ef old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL
> > > kworker/0:1H-220   [000] ...2 56127.038911: xfs_ail_insert: dev 253:4 lip 0000000028d2061f old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL
> > > 	....
> > >
> > > This sequence starts with one log item in the AIL and some number of
> > > items in the CIL such that a checkpoint executes from the background log
> > > worker. The worker forces the CIL and log I/O completion issues an AIL
> > > push that is truncated by the recently updated ->l_last_sync_lsn due to
> > > outstanding transaction reservation and small AIL size. This push races
> > > with completion of a previous push that empties the AIL and iclog
> > > callbacks insert log items for the current checkpoint at the LSN target
> > > xfsaild just idled at.
> > 
> > I'm just not seeing what the problem here is. The behaviour you are
> > describing has been around since day zero and doesn't require the
> > addition of an ail push from iclog completion to trigger.  Prior to
> > this series, it would be:
> > 
> 
> A few days ago you said that if we're inserting log items before the
> push target, "something is very wrong." Since this was what I was
> concerned about, I attempted to manufacture the issue to demonstrate.
> You suggested the first reproducer I came up with was a separate problem
> (related to CIL size issues), so I came up with the one above to avoid
> that distraction. Now you're telling me this has always happened and is
> fine..
> 
> While I don't think this is quite accurate (more below), I do find this
> reasoning somewhat amusing in that it essentially implies that this
> patch itself is dubious. If this new AIL push is required to fix a real
> issue, and this race is essentially manifest as implied, then this patch
> can't possibly reliably fix the original problem. Anyways, that is
> neither here nor there..
> 
> All of the details of this particular issue aside, I do think there's a
> development process problem here. It shouldn't require an extended game
> of whack-a-mole with this kind of inconsistent reasoning just to request
> a trivial change to a patch (you also implied in a previous response it
> was me wasting your time on this topic) that closes an obvious race and
> otherwise has no negative effect. Someone is being unreasonable here and
> I don't think it's me. More importantly, discussion of open issues
> shouldn't be a race against the associated patch being merged. :/
> 
> > process 1	reservation	log completion	xfsaild
> > <completes metadata IO>
> >   xfs_ail_delete()
> >     mlip_changed
> >     xlog_assign_tail_lsn_locked()
> >       ail empty, sets l_last_sync = 0x1000032e2
> >     xfs_log_space_wake()
> > 				xlog_state_do_callback
> > 				  sets CALLBACK
> > 				  sets last_sync_lsn to iclog head
> > 				    -> 0x1000032e4
> > 				  <drops icloglock, gets preempted>
> > 		<wakes>
> > 		xlog_grant_head_wait
> > 		  free_bytes < need_bytes
> > 		    xlog_grant_push_ail()
> > 		      xlog_push_ail()
> > 		        ->ail_target 0x1000032e4
> > 		<sleeps>
> > 						<wakes>
> > 						sets prev target to 0x1000032e4
> > 						sees empty AIL
> > 						<sleeps>
> > 				    <runs again>
> > 				    runs callbacks
> > 				      xfs_ail_insert(lsn = 0x1000032e4)
> > 
> > and now we have the AIL push thread asleep with items in it at the
> > push threshold.  IOWs, what you describe has always been possible,
> > and before the CIL was introduced this sort of thing happened quite
> > a bit because iclog completions freed up much less space in the log
> > than a CIL commit completion.
> > 
> 
> I was suspicious that this could occur prior to this change but I hadn't
> confirmed. The scenario documented above cannot occur because a push on
> an empty AIL has no effect. The target doesn't move and the task isn't
> woken. That said, I still suspect the race can occur with the current
> code via between a grant head waiter, AIL emptying and iclog completion.
> 
> This just speaks to the frequency of the problem, though. I'm not
> convinced it's something that happens "quite a bit" given the nature of
> the 3-way race. I also don't agree that existence of a historical
> problem somehow excuses introduction a new variant of the same problem.
> Instead, if this patch exposes a historical problem that simply had no
> noticeable impact to this point, we should probably look into whether it
> needs fixing too.
> 
> > It's not a problem, however, because if we are out of transaction
> > reservation space we must have transactions in progress, and as long
> > as they make progress then the commit of each transaction will end
> > up calling xlog_ungrant_log_space() to return the unused portion of
> > the transaction reservation. That calls xfs_log_space_wake() to
> > allow reservation waiters to try to make progress.
> > 
> 
> Yes, this is why I don't see immediate side effects in the tests I've
> run so far. The assumptions you're basing this off are not always true,
> however. Particularly on smaller (<= 1GB) filesystems, it's relatively
> easy to produce conditions where the entire reservation space is
> consumed by open transactions that don't ultimately commit anything to
> the log subsystem and thus generate no forward progress.
> 
> > If there's still not enough space reservation after the transaction
> > in progress has released it's reservation, then it goes back to
> > sleep. As long as we have active transactions in progress while
> > there are transaction reservations waiting on reservation space,
> > there will be a wakeup vector for the reservation independent of
> > the CIL, iclogs and AIL behaviour.
> > 
> 
> We do have clean transaction cancel and error scenarios, existing log
> deadlock vectors, increasing reliance on long running transactions via
> deferred ops, scrub, etc. Also consider the fact that open transactions
> consume considerably more reservation than committed transactions on
> average.
> 
> I'm not saying it's likely for a real world workload to consume the
> entirety of log reservation space via open transactions and then release
> it without filesystem modification (and then race with log I/O and AIL
> emptying), but from the perspective of proving the existence of a bug
> it's really not that difficult to produce. I've not seen a real world
> workload that reproduces the problems fixed by any of these patches
> either, but we still fix them.
> 
> > [ Yes, there was a bug here, in the case xfs_log_space_wake() did
> > not issue a wakeup because of not enough space being availble and
> > the push target was limited by the old log head location. i.e.
> > nothing ever updated the push target to reflect the new log head and
> > so the tail might never get moved now. That particular bug was fixed
> > by a an earlier patch in the series, so we can ignore it here. ]
> > 
> > IOWs, if the AIL is empty, the CIL cannot consume more than 25% of
> > the log space, and we have transactions waiting on log reservation
> > space, then we must have enough transactions in progress to cover at
> > least 75% of the log space. Completion of those transactions will
> > wake waiters and, if necessary, push the AIL again to keep the log
> > tail moving appropriately. This handles the AIL empty and "insert
> > before target" situations you are concerned about just fine, as long
> > as we have a guarantee of forwards progress. Bounding the CIL size
> > provides that forwards progress guarantee for the CIL...
> > 
> 
> I think you have some tunnel vision or something going on here with
> regard to the higher level architectural view of how things are supposed
> to operate in a normal running/steady state vs simply what can and
> cannot happen in the code. I can't really tell why/how, but the only
> suggestion I can make is to perhaps separate from this high level view
> of things and take a closer look at the code. This is a simple code bug,
> not some grand architectural flaw. The context here is way out of whack.
> The repeated unrelated and overblown architectural assertions come off
> as indication of lack of any real argument to allow this race to live.
> There is simply no such guarantee of forward progress in all scenarios
> that produce the conditions that can cause this race.
> 
> Yet another example:
> 
>            <...>-369   [002] ...2   220.055746: xfs_ail_insert: dev 253:4 lip 00000000ddb123f2 old lsn 0/0 new lsn 1/248 type XFS_LI_INODE flags IN_AIL
>            <...>-27    [003] ...1   224.753110: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
>            <...>-404   [003] ...1   224.775551: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000000fa
>      kworker/3:1-39    [003] ...2   224.777953: xfs_ail_delete: dev 253:4 lip 00000000ddb123f2 old lsn 1/248 new lsn 1/248 type XFS_LI_INODE flags IN_AIL
>     xfsaild/dm-4-1034  [000] ....   224.797919: xfsaild: 588: idle ->ail_target 0x1000000fa
>     kworker/3:1H-404   [003] ...2   225.841198: xfs_ail_insert: dev 253:4 lip 000000006845aeed old lsn 0/0 new lsn 1/250 type XFS_LI_INODE flags IN_AIL
>      kworker/3:1-39    [003] ...1   254.962822: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> 	...
>      kworker/3:2-1920  [003] ...1  3759.291275: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs]
> 
> 
> # cat /sys/fs/xfs/dm-4/log/log_*lsn
> 1:252
> 1:250
> 
> This instance of the race uses the same serialization instrumentation to
> control execution timing and whatnot as before (i.e. no functional
> changes). First, an item is inserted into the AIL. Immediately after AIL
> insertion, another transaction commits to the CIL (not shown in the
> trace). The background log worker comes around a few seconds later and
> forces the log/CIL. The checkpoint for this log force races with an AIL
> delete and idle (same as before). AIL insertion occurs at the push
> target xfsaild just idled at, but this time reservation pressure
> relieves and the filesystem goes idle.
> 
> At this point, nothing occurs on the fs except for continuous background
> log worker jobs. Note the timestamp difference between the first
> post-race log force and the last in the trace. The log worker runs at
> the default 30s interval and has run repeatedly for almost an hour while
> failing to push the AIL and subsequently cover the log. To confirm the
> AIL is populated, see the log head/tail LSNs reported via sysfs. This
> state persists indefinitely so long as the fs is idle. This is a bug.

/me stumbles back in after ~2wks, and has a few questions:

1) Are these concerns a reason to hold up this series, or are they a
separate bug lurking in the code being touched by the series?  AFAICT I
think it's the second, but <shrug> my brain is still mush.

2) Er... how do you get the log stuck like this?  I see things earlier
in the thread like "open transactions that don't ultimately commit
anything to the log subsystem" and think "OH, you mean xfs_scrub!"

--D

> Brian
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@xxxxxxxxxxxxx