On Wednesday, August 21, 2019 10:12 PM Brian Foster wrote: > On Wed, Aug 21, 2019 at 04:34:48PM +0530, Chandan Rajendra wrote: > > The following call trace is seen when executing generic/530 on a ppc64le > > machine, > > > > Could you provide xfs_info of the scratch device that reproduces this > problem? I'm curious because I don't recall seeing this, but I'm also > not sure I've run those iunlink tests on ppc64... # xfs_info /dev/loop1 meta-data=/dev/loop1 isize=512 agcount=4, agsize=159648 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=1 = reflink=1 data = bsize=4096 blocks=638592, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=3693, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 > > > INFO: task mount:7722 blocked for more than 122 seconds. > > Not tainted 5.3.0-rc1-next-20190723-00001-g1867922e5cbf-dirty #6 > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > mount D 8448 7722 7490 0x00040008 > > Call Trace: > > [c000000629343210] [0000000000000001] 0x1 (unreliable) > > [c0000006293433f0] [c000000000021acc] __switch_to+0x2ac/0x490 > > [c000000629343450] [c000000000fbbbf4] __schedule+0x394/0xb50 > > [c000000629343510] [c000000000fbc3f4] schedule+0x44/0xf0 > > [c000000629343540] [c0000000007623b4] xlog_grant_head_wait+0x84/0x420 > > [c0000006293435b0] [c000000000762828] xlog_grant_head_check+0xd8/0x1e0 > > [c000000629343600] [c000000000762f6c] xfs_log_reserve+0x26c/0x310 > > [c000000629343690] [c00000000075defc] xfs_trans_reserve+0x28c/0x3e0 > > [c0000006293436e0] [c0000000007606ac] xfs_trans_alloc+0xfc/0x2f0 > > [c000000629343780] [c000000000749ca8] xfs_inactive_ifree+0x248/0x2a0 > > [c000000629343810] [c000000000749e58] xfs_inactive+0x158/0x300 > > [c000000629343850] [c000000000758554] xfs_fs_destroy_inode+0x104/0x3f0 > > [c000000629343890] [c00000000046850c] destroy_inode+0x6c/0xc0 > > [c0000006293438c0] [c00000000074c748] xfs_irele+0x168/0x1d0 > > [c000000629343900] [c000000000778c78] xlog_recover_process_one_iunlink+0x118/0x1e0 > > [c000000629343960] [c000000000778e10] xlog_recover_process_iunlinks+0xd0/0x130 > > [c0000006293439b0] [c000000000782408] xlog_recover_finish+0x58/0x130 > > [c000000629343a20] [c000000000763818] xfs_log_mount_finish+0xa8/0x1d0 > > [c000000629343a60] [c000000000750908] xfs_mountfs+0x6e8/0x9e0 > > [c000000629343b20] [c00000000075a210] xfs_fs_fill_super+0x5a0/0x7c0 > > [c000000629343bc0] [c00000000043e7fc] mount_bdev+0x25c/0x2a0 > > [c000000629343c60] [c000000000757c48] xfs_fs_mount+0x28/0x40 > > [c000000629343c80] [c0000000004956cc] legacy_get_tree+0x4c/0xb0 > > [c000000629343cb0] [c00000000043d690] vfs_get_tree+0x50/0x160 > > [c000000629343d30] [c0000000004775d4] do_mount+0xa14/0xc20 > > [c000000629343db0] [c000000000477d48] ksys_mount+0xc8/0x180 > > [c000000629343e00] [c000000000477e20] sys_mount+0x20/0x30 > > [c000000629343e20] [c00000000000b864] system_call+0x5c/0x70 > > > > i.e. the mount task gets hung indefinitely due to the following sequence > > of events, > > > > 1. Test creates lots of unlinked temp files and then shutsdown the > > filesystem. > > 2. During mount, a transaction started in the context of processing > > unlinked inode list causes several iclogs to be filled up. All but > > the last one is submitted for I/O. > > So the inactivation transaction commits should basically start to > populate the CIL. At some point we cross a threshold and a commit > triggers a background CIL push (see xlog_cil_push_background()). I'm > assuming that occurs somewhere in here because otherwise we wouldn't > have started filling iclogs.. > > > 3. After writing XLOG_COMMIT_TRANS record into the iclog, we will have > > 18532 bytes of free space in the last iclog of the transaction which is > > greater than 2*sizeof(xlog_op_header_t). Hence > > xlog_state_get_iclog_space() does not switch over to using a newer iclog. > > Ok, so we've sent a bunch of iclogs to disk and the commit record for > the checkpoint ends up in the current iclog log, which remains active on > return from xlog_commit_record() -> xlog_write() etc. > > > 4. Meanwhile, the endio code processing iclogs of the transaction do not > > insert items into the AIL since the iclog containing XLOG_COMMIT_TRANS > > hasn't been submitted for I/O yet. Hence a major part of the on-disk > > log cannot be freed yet. > > Ok, so that final (still active) iclog holding the commit record is also > holding the ctx. > > > 5. A new request for log space (via xfs_log_reserve()) will now wait > > indefinitely for on-disk log space to be freed. > > > > To fix this issue, before waiting for log space to be freed, this commit > > now submits xlog->l_iclog for write I/O if iclog->ic_state is > > XLOG_STATE_ACTIVE and iclog has metadata written into it. This causes > > AIL list to be populated and a later call to xlog_grant_push_ail() will > > free up the on-disk log space. > > > > Interesting, so I think I follow the issue at least and what the code is > trying to do to fix it. I'm not totally clear on what the ideal approach > is as this code is complex and I'd need to think about it some more. My > first thought was perhaps whether we needed a checkpoint size limit or > to change the background push threshold or some such, but the more I > think about that the less I think that fixes anything. > > A follow up thought is that it seems somewhat unfortunate that we can > build up and hold so much reservation on essentially a single commit > record and then just let it sit around in-core waiting for somebody else > to come along and ask for reservation for something unrelated. I think a > background log force should eventually come around and push that last > record out, so we're not in that state indefinitely, but IIUC that's a > lot of log consumption essentially unaccounted for until the associated > items land in the AIL. > > I'm wondering if we should have some measure (if we don't already) of > how much reservation is tied into a single iclog and use that as an > additional metric to determine when to switch iclogs a bit more > aggressively (as opposed to only based on how much space is physically > left in the iclog buffer). For example, if we end a huge checkpoint with > a commit record in an active iclog that has a cil context attached > associated with a hundreds of MBs (handwaving) of log data, perhaps we > should just switch out of that iclog before the CIL push returns so that > consumption can be immediately reflected by the AIL. > > Anyways, perhaps there are other thoughts/ideas on this.. > > Brian > > > Signed-off-by: Chandan Rajendra <chandanrlinux@xxxxxxxxx> > > --- > > fs/xfs/xfs_log.c | 21 +++++++++++++++++++++ > > 1 file changed, 21 insertions(+) > > > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c > > index 00e9f5c388d3..dc785a6b9f47 100644 > > --- a/fs/xfs/xfs_log.c > > +++ b/fs/xfs/xfs_log.c > > @@ -236,11 +236,32 @@ xlog_grant_head_wait( > > int need_bytes) __releases(&head->lock) > > __acquires(&head->lock) > > { > > + struct xlog_in_core *iclog; > > + > > list_add_tail(&tic->t_queue, &head->waiters); > > > > do { > > if (XLOG_FORCED_SHUTDOWN(log)) > > goto shutdown; > > + > > + if (xfs_ail_min(log->l_ailp) == NULL) { > > + spin_lock(&log->l_icloglock); > > + iclog = log->l_iclog; > > + > > + if (iclog->ic_state == XLOG_STATE_ACTIVE > > + && iclog->ic_offset) { > > + atomic_inc(&iclog->ic_refcnt); > > + xlog_state_want_sync(log, iclog); > > + spin_unlock(&log->l_icloglock); > > + xlog_state_release_iclog(log, iclog); > > + > > + spin_lock(&log->l_icloglock); > > + xlog_wait(&iclog->ic_force_wait, &log->l_icloglock); > > + } else { > > + spin_unlock(&log->l_icloglock); > > + } > > + } > > + > > xlog_grant_push_ail(log, need_bytes); > > > > __set_current_state(TASK_UNINTERRUPTIBLE); > -- chandan