On 2011.06.22 at 10:04 +1000, Dave Chinner wrote: > On Tue, Jun 21, 2011 at 08:57:01PM +0200, Markus Trippelsdorf wrote: > > On 2011.06.21 at 20:24 +0200, Markus Trippelsdorf wrote: > > > On 2011.06.21 at 10:02 +0200, Markus Trippelsdorf wrote: > > > > On 2011.06.21 at 14:25 +1000, Dave Chinner wrote: > > > > > That is, you really need to get a profile of the rm -rf process - or > > > > > whatever is consuming the CPU time - (e.g. via perf or ftrace) > > > > > across the hang to so we can narrow down the potential cause of the > > > > > latency. Speaking of which, latencytop might be helpful in > > > > > identifying where input is getting held up.... > > > > > > > > I've recorded a profile with "perf record -g /home/markus/rm_sync" > > > > ~ % cat rm_sync > > > > rm -fr /mnt/tmp/tmp/linux && sync > > > > > > FWIW here are two links to svg time-charts produced by: > > > > > > perf timechart record /home/markus/rm_sync > > > > > > http://trippelsdorf.de/timechart1.svg > > > http://trippelsdorf.de/timechart2.svg > > > > > > > And this is what the mysterious kworker is doing during the sync. > > It's the one consuming most of the CPU time. > > > > 39.96% kworker/3:0 [kernel.kallsyms] 0xffffffff811da9da k [k] xfs_trans_ail_update_bulk > > | > > --- xfs_trans_ail_update_bulk > > xfs_trans_committed_bulk > > xlog_cil_committed > > xlog_state_do_callback > > xlog_state_done_syncing > > xlog_iodone > > xfs_buf_iodone_work > > process_one_work > > worker_thread > > kthread > > kernel_thread_helper > > So that is inserting items into the AIL at transaction completion. > That can consume lots of CPU time if the CIL checkpoints are being > flushed quickly enough. Given you are doing a rm -rf at this point > in time, I'd expect to see this trace present in the profile, though > maybe not at that extent. > > I have seen this before but have never been able to it reproduce > reliably. If checkpoints are completed "out of order" due to the > way the commit records are written into the iclogs. This can cause > extra CPU because the AIL insertion then has to skip over all the > items that were inserted out of order before splicing each group of > items into the AIL. I only rarely see this (maybe once every couple > of weeks of performance testing), so I'm not sure it's the problem > you are seeing. > > Adding this debug to xfs_ail_splice() list walk will tell us if this is > happening and how many items it had to walk when you see a hang: > > xfs_lsn_t lsn) > { > xfs_log_item_t *next_lip; > + int walked = 0; > > /* If the list is empty, just insert the item. */ > if (list_empty(&ailp->xa_ail)) { > list_splice(list, &ailp->xa_ail); > return; > } > > list_for_each_entry_reverse(next_lip, &ailp->xa_ail, li_ail) { > if (XFS_LSN_CMP(next_lip->li_lsn, lsn) <= 0) > break; > + if (!walked++) { > + xfs_warn(ailp->xa_mount, > + "ail: ooo splice, tail 0x%llx, item 0x%llx\n", > + next_lip->li_lsn, lsn); > + } > } > + if (walked > 10) { > + xfs_warn(ailp->xa_mount, > + "ail: ooo splice, walked %d items\n", walked); > + } > > ASSERT(&next_lip->li_ail == &ailp->xa_ail || > XFS_LSN_CMP(next_lip->li_lsn, lsn) <= 0); > > That will at least tell us if this is the cause of your problem. If > it is, I think I know how to avoid most of the list walk overhead > fairly easily and that should avoid the need to change workqueue > configurations at all. The kernel log is attached. > > The following patch fixes the problem for me. > > > > diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c > > index 5e68099..2f34efd 100644 > > --- a/fs/xfs/linux-2.6/xfs_buf.c > > +++ b/fs/xfs/linux-2.6/xfs_buf.c > > @@ -1856,7 +1856,7 @@ xfs_buf_init(void) > > goto out; > > > > xfslogd_workqueue = alloc_workqueue("xfslogd", > > - WQ_MEM_RECLAIM | WQ_HIGHPRI, 1); > > + WQ_MEM_RECLAIM | WQ_UNBOUND, 1); > > if (!xfslogd_workqueue) > > goto out_free_buf_zone; > > That implies the source of your X/ssh hangs is a workqueue priority > inversion. > > We need the xfslogd to run with WQ_HIGHPRI because we can deadlock > in IO completion if it is blocked by xfsdatad/xfsconvertd wq > operations. Hence we need to process the logd workqueue ahead of > data IO completions and we do that by making the log IO completion > high priority. > > This thing is, the CMWQ infrastructure only has one level of "high > priority" - we can't define any other sort of dependency between > various queues - and that queue is simply a FIFO. IOWs, CMWQ will > process anything that is high priority ahead of -all- other normal > priority work items. That means that if you get a high priority > workqueue with lots to do and consuming CPU (as we are seeing here) > it will starve all other work items that are queued on that CPU. > > So what you end up seeing is interactive process related work being > queued on CPU 3 which is then being starved of processing by XFS IO > completion work on the same CPU. By setting the xfslogd workqueue to > unbound, it no longer holds precedence over any of the other work > and hence does not starve the queued interactive process related work. > That's why the patch "fixes" your 1-2s hangs while log IO is in > progress, but it will eventually deadlock and hang the filesystem > permanently. > > There are two different things you can try with the wq > initialisation that might help prevent the problem. Firstly, try > this: > > - WQ_MEM_RECLAIM | WQ_HIGHPRI, 1); > + WQ_MEM_RECLAIM | WQ_HIGHPRI, 8); > > To change the number of concurrent per-cpu work items that can be > processed on the work queue. I don't think that will fix the > inversion, but it may allow more concurrency which hides the > inversion. > > The other thing you can try is: > > - WQ_MEM_RECLAIM | WQ_HIGHPRI, 1); > + WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_CPU_INTENSIVE, 1); > > Which marks the workqueue as one that will consume a lot of CPU and > hence it is scheduled differently and hence should avoid other > pending work from being starved. We use this WQ_CPU_INTENSIVE flag > on other XFS workqueues that are known to consume lots of CPU, so I > suspect this is the right thing to do here. Yes, that was the next combination I've tested after WQ_UNBOUND and it solves the issue, too. -- Markus
Attachment:
kern.log.bz2
Description: BZip2 compressed data
_______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs