Re: [RFC] xfs: Flush iclog containing XLOG_COMMIT_TRANS before waiting for log space

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 23 Aug 2019 10:06:36 +1000

On Thu, Aug 22, 2019 at 12:34:46PM -0400, Brian Foster wrote:
> On Thu, Aug 22, 2019 at 08:18:34AM +1000, Dave Chinner wrote:
> > On Wed, Aug 21, 2019 at 04:34:48PM +0530, Chandan Rajendra wrote:
> > > The following call trace is seen when executing generic/530 on a ppc64le
> > > machine,
> > > 
> > > INFO: task mount:7722 blocked for more than 122 seconds.
> > >       Not tainted 5.3.0-rc1-next-20190723-00001-g1867922e5cbf-dirty #6
> > 
> > can you reproduce this on 5.3-rc5? There were bugs in log recovery
> > IO in -rc1 that could result in things going wrong...
> > 
> > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > mount           D 8448  7722   7490 0x00040008
> > > Call Trace:
> > > [c000000629343210] [0000000000000001] 0x1 (unreliable)
> > > [c0000006293433f0] [c000000000021acc] __switch_to+0x2ac/0x490
> > > [c000000629343450] [c000000000fbbbf4] __schedule+0x394/0xb50
> > > [c000000629343510] [c000000000fbc3f4] schedule+0x44/0xf0
> > > [c000000629343540] [c0000000007623b4] xlog_grant_head_wait+0x84/0x420
> > > [c0000006293435b0] [c000000000762828] xlog_grant_head_check+0xd8/0x1e0
> > > [c000000629343600] [c000000000762f6c] xfs_log_reserve+0x26c/0x310
> > > [c000000629343690] [c00000000075defc] xfs_trans_reserve+0x28c/0x3e0
> > > [c0000006293436e0] [c0000000007606ac] xfs_trans_alloc+0xfc/0x2f0
> > > [c000000629343780] [c000000000749ca8] xfs_inactive_ifree+0x248/0x2a0
> > > [c000000629343810] [c000000000749e58] xfs_inactive+0x158/0x300
> > > [c000000629343850] [c000000000758554] xfs_fs_destroy_inode+0x104/0x3f0
> > > [c000000629343890] [c00000000046850c] destroy_inode+0x6c/0xc0
> > > [c0000006293438c0] [c00000000074c748] xfs_irele+0x168/0x1d0
> > > [c000000629343900] [c000000000778c78] xlog_recover_process_one_iunlink+0x118/0x1e0
> > > [c000000629343960] [c000000000778e10] xlog_recover_process_iunlinks+0xd0/0x130
> > > [c0000006293439b0] [c000000000782408] xlog_recover_finish+0x58/0x130
> > > [c000000629343a20] [c000000000763818] xfs_log_mount_finish+0xa8/0x1d0
> > > [c000000629343a60] [c000000000750908] xfs_mountfs+0x6e8/0x9e0
> > > [c000000629343b20] [c00000000075a210] xfs_fs_fill_super+0x5a0/0x7c0
> > > [c000000629343bc0] [c00000000043e7fc] mount_bdev+0x25c/0x2a0
> > > [c000000629343c60] [c000000000757c48] xfs_fs_mount+0x28/0x40
> > > [c000000629343c80] [c0000000004956cc] legacy_get_tree+0x4c/0xb0
> > > [c000000629343cb0] [c00000000043d690] vfs_get_tree+0x50/0x160
> > > [c000000629343d30] [c0000000004775d4] do_mount+0xa14/0xc20
> > > [c000000629343db0] [c000000000477d48] ksys_mount+0xc8/0x180
> > > [c000000629343e00] [c000000000477e20] sys_mount+0x20/0x30
> > > [c000000629343e20] [c00000000000b864] system_call+0x5c/0x70
> > > 
> > > i.e. the mount task gets hung indefinitely due to the following sequence
> > > of events,
> > > 
> > > 1. Test creates lots of unlinked temp files and then shutsdown the
> > >    filesystem.
> > > 2. During mount, a transaction started in the context of processing
> > >    unlinked inode list causes several iclogs to be filled up. All but
> > >    the last one is submitted for I/O.
> > > 3. After writing XLOG_COMMIT_TRANS record into the iclog, we will have
> > >    18532 bytes of free space in the last iclog of the transaction which is
> > >    greater than 2*sizeof(xlog_op_header_t). Hence
> > >    xlog_state_get_iclog_space() does not switch over to using a newer iclog.
> > > 4. Meanwhile, the endio code processing iclogs of the transaction do not
> > >    insert items into the AIL since the iclog containing XLOG_COMMIT_TRANS
> > >    hasn't been submitted for I/O yet. Hence a major part of the on-disk
> > >    log cannot be freed yet.
> > 
> > So all those items are still pinned in memory.
> > 
> > > 5. A new request for log space (via xfs_log_reserve()) will now wait
> > >    indefinitely for on-disk log space to be freed.
> > 
> > Because nothing has issued a xfs_log_force() for write the iclog to
> > disk, unpin the objects that it pins in memory, and allow the tail
> > to be moved forwards.
> > 
> > The xfsaild normally takes care of thisi - it gets pushed byt the
> > log reserve when there's not enough space to in the log for the
> > transaction before transaction reserve goes to sleep in
> > xlog_grant_head_wait(). The AIL pushing code is then responsible for
> > making sure log space is eventually freed. It will issue log forces
> > if it isn't making progress and so this problem shouldn't occur.
> > 
> > So, why has it occurred?
> > 
> > The xfsaild kthread should be running at this point, so if it was
> > pushed it should be trying to empty the journal to move the tail
> > forward. Why hasn't it issue a log force?
> > 
> > 
> > > To fix this issue, before waiting for log space to be freed, this commit
> > > now submits xlog->l_iclog for write I/O if iclog->ic_state is
> > > XLOG_STATE_ACTIVE and iclog has metadata written into it. This causes
> > > AIL list to be populated and a later call to xlog_grant_push_ail() will
> > > free up the on-disk log space.
> > 
> > hmmm.
> > 
> > > Signed-off-by: Chandan Rajendra <chandanrlinux@xxxxxxxxx>
> > > ---
> > >  fs/xfs/xfs_log.c | 21 +++++++++++++++++++++
> > >  1 file changed, 21 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > > index 00e9f5c388d3..dc785a6b9f47 100644
> > > --- a/fs/xfs/xfs_log.c
> > > +++ b/fs/xfs/xfs_log.c
> > > @@ -236,11 +236,32 @@ xlog_grant_head_wait(
> > >  	int			need_bytes) __releases(&head->lock)
> > >  					    __acquires(&head->lock)
> > >  {
> > > +	struct xlog_in_core	*iclog;
> > > +
> > >  	list_add_tail(&tic->t_queue, &head->waiters);
> > >  
> > >  	do {
> > >  		if (XLOG_FORCED_SHUTDOWN(log))
> > >  			goto shutdown;
> > > +
> > > +		if (xfs_ail_min(log->l_ailp) == NULL) {
> > 
> > This is indicative of the situation. If the AIL is empty, and the
> > log does not have room for an entire transaction reservation, then
> > we need to be issuing synchronous transactions in recovery until
> > such time the AIL pushing can actually function correctly to
> > guarantee forwards progress for async transaction processing.
> > 
> 
> Hmm, I don't think that addresses the fundamental problem. This
> phenomenon doesn't require log recovery. The same scenario can present
> itself after a clean mount or from an idle fs. I think the scenario that
> plays out here, at a high level, is as follows:
> 

  - mount
  - [log recovery]
  - xfs_log_mount_finish
    - calls xfs_log_work_queue()

> - Heavy transaction workload commences. This continuously acquires log
>   reservation and transfers it to the CIL as transactions commit.
> - The CIL context grows until we cross the background threshold, at
>   which point we schedule a background push.
> - Background CIL push cycles the current context into the log via the
>   iclog buffers. The commit record stays around in-core because the last
>   iclog used for the CIL checkpoint isn't full. Hence, none of the
>   associated log items make it into the AIL and the background CIL push
>   had no effect with respect to freeing log reservation.
> - The same transaction workload is still running and filling up the next
>   CIL context. If we run out of log reservation before a second
>   background CIL push comes along, we're basically stuck waiting on
>   somebody to force the log.

- every 30s xfs_log_worker() runs, sees the log dirty, triggers
  a log force. pending commit is flushed to log, dirty objects get
  moved to AIL, then xfs_log_worker() pushes on the AIL to do
  periodic background metadata writeback.

> The things that prevent this at normal runtime are timely execution of
> background CIL pushes and the background log worker. If for some reason
> the background CIL push is not timely enough that we consume all log
> reservation before two background CIL pushes occur from the time the
> racing workload begins (i.e. starting from an idle system such that the
> AIL is empty), then we're stuck waiting on the background log worker to
> force the log from the first background CIL push, populate the AIL and
> get things moving again.

Right, this does not deadlock - it might pause for a short while
while waiting for the log worker to run and issue a log force. I
have never actually seen it happen in all my years of "mkfs; mount;
fsmark" testing that places a /massive/ metadata modification
workload on a pristine, newly mounted filesystems....

As it is, we've always used the log worker as a watchdog in this
way. The fact is that we have very few situations left in the code
where it needs to act as a watchdog - delayed logging actually
negated the vast majority of problems that required the periodic log
force to get out of trouble because individual transactions no
longer needed to wait on iclog space to make progress...

> IOW, the same essential problem is reproducible outside of log
> recovery in the form of stalls as opposed to deadlocks via an
> artificial background CIL push delay (i.e., think workqueue or
> xc_cil_ctx lock starvation) and an elevated xfssyncd_centisecs.
> We aren't stuck forever because the background log worker will run
> eventually, but it could certainly be a dead stall of minutes
> before that occurs.

I don't think addressing it in xlog_grant_head_wait() fixes the
problem fully, either.  If no other transaction comes in, then the
ones already blocked (because the AIL was not empty when they tried
to reserve space) will end up still blocked because nothing has
kicked the code in the transaction reservation code. So putting the
log force into the grant head wait code is not sufficient by itself.

> I think this
> could still be addressed at transaction commit or reservation time, but
> I think the logic needs to be more generic and based on log reservation
> pressure rather than the context from which this particular test happens
> to reproduce.

Log reservation pressure is what xlog_grant_push_ail() detects and
that pressure is transferred to the AIL pushing code to clean dirty
log items and move the tail of the log forward. It's right there
where Chandan added the log force. :)

IOWs, xlog_grant_push_ail() tells the xfsaild how much log space to
make available. If the log is full, then xlog_grant_push_ail() will
already be telling the AIL to push and will be waking it up.
However, the aild will see the AIL empty and go right back to sleep.
That's likely the runtime problem here - the mechanism that pushes
the log tail forwards is not realising that the log needs pushing
via a log force.

IOWs, I suspect the xfsaild is the right place to take the action,
because AIL pushing is triggered by much more than just log
reservations. It gets kicked by memory reclaim, the log worker, new
transactions, etc and so if a transaction doesn't kick it when it
gest stuck like this, something else will.

> If this is all on the right track, I'm still curious if/how you're
> getting into a situation where all log reservation is held up in the CIL
> before a couple background pushes occur.

I'd guess that it could be reproduced via a single CPU machine and
non-preempt kernel. We've already replayed all the unlink
transactions, so the buffer/inode caches are fully primed. If all
unlink removal transactions the buffer/inode cache, then it won't
block anywhere and will never yield the CPU. Hence the CIL push
kworker thread doesn't get to run before the unlinks run the log out
of space.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx