On Tue, Jun 21, 2011 at 08:57:01PM +0200, Markus Trippelsdorf wrote: > On 2011.06.21 at 20:24 +0200, Markus Trippelsdorf wrote: > > On 2011.06.21 at 10:02 +0200, Markus Trippelsdorf wrote: > > > On 2011.06.21 at 14:25 +1000, Dave Chinner wrote: > > > > That is, you really need to get a profile of the rm -rf process - or > > > > whatever is consuming the CPU time - (e.g. via perf or ftrace) > > > > across the hang to so we can narrow down the potential cause of the > > > > latency. Speaking of which, latencytop might be helpful in > > > > identifying where input is getting held up.... > > > > > > I've recorded a profile with "perf record -g /home/markus/rm_sync" > > > ~ % cat rm_sync > > > rm -fr /mnt/tmp/tmp/linux && sync > > > > FWIW here are two links to svg time-charts produced by: > > > > perf timechart record /home/markus/rm_sync > > > > http://trippelsdorf.de/timechart1.svg > > http://trippelsdorf.de/timechart2.svg > > > > And this is what the mysterious kworker is doing during the sync. > It's the one consuming most of the CPU time. > > 39.96% kworker/3:0 [kernel.kallsyms] 0xffffffff811da9da k [k] xfs_trans_ail_update_bulk > | > --- xfs_trans_ail_update_bulk > xfs_trans_committed_bulk > xlog_cil_committed > xlog_state_do_callback > xlog_state_done_syncing > xlog_iodone > xfs_buf_iodone_work > process_one_work > worker_thread > kthread > kernel_thread_helper So that is inserting items into the AIL at transaction completion. That can consume lots of CPU time if the CIL checkpoints are being flushed quickly enough. Given you are doing a rm -rf at this point in time, I'd expect to see this trace present in the profile, though maybe not at that extent. I have seen this before but have never been able to it reproduce reliably. If checkpoints are completed "out of order" due to the way the commit records are written into the iclogs. This can cause extra CPU because the AIL insertion then has to skip over all the items that were inserted out of order before splicing each group of items into the AIL. I only rarely see this (maybe once every couple of weeks of performance testing), so I'm not sure it's the problem you are seeing. Adding this debug to xfs_ail_splice() list walk will tell us if this is happening and how many items it had to walk when you see a hang: xfs_lsn_t lsn) { xfs_log_item_t *next_lip; + int walked = 0; /* If the list is empty, just insert the item. */ if (list_empty(&ailp->xa_ail)) { list_splice(list, &ailp->xa_ail); return; } list_for_each_entry_reverse(next_lip, &ailp->xa_ail, li_ail) { if (XFS_LSN_CMP(next_lip->li_lsn, lsn) <= 0) break; + if (!walked++) { + xfs_warn(ailp->xa_mount, + "ail: ooo splice, tail 0x%llx, item 0x%llx\n", + next_lip->li_lsn, lsn); + } } + if (walked > 10) { + xfs_warn(ailp->xa_mount, + "ail: ooo splice, walked %d items\n", walked); + } ASSERT(&next_lip->li_ail == &ailp->xa_ail || XFS_LSN_CMP(next_lip->li_lsn, lsn) <= 0); That will at least tell us if this is the cause of your problem. If it is, I think I know how to avoid most of the list walk overhead fairly easily and that should avoid the need to change workqueue configurations at all. > The following patch fixes the problem for me. > > diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c > index 5e68099..2f34efd 100644 > --- a/fs/xfs/linux-2.6/xfs_buf.c > +++ b/fs/xfs/linux-2.6/xfs_buf.c > @@ -1856,7 +1856,7 @@ xfs_buf_init(void) > goto out; > > xfslogd_workqueue = alloc_workqueue("xfslogd", > - WQ_MEM_RECLAIM | WQ_HIGHPRI, 1); > + WQ_MEM_RECLAIM | WQ_UNBOUND, 1); > if (!xfslogd_workqueue) > goto out_free_buf_zone; That implies the source of your X/ssh hangs is a workqueue priority inversion. We need the xfslogd to run with WQ_HIGHPRI because we can deadlock in IO completion if it is blocked by xfsdatad/xfsconvertd wq operations. Hence we need to process the logd workqueue ahead of data IO completions and we do that by making the log IO completion high priority. This thing is, the CMWQ infrastructure only has one level of "high priority" - we can't define any other sort of dependency between various queues - and that queue is simply a FIFO. IOWs, CMWQ will process anything that is high priority ahead of -all- other normal priority work items. That means that if you get a high priority workqueue with lots to do and consuming CPU (as we are seeing here) it will starve all other work items that are queued on that CPU. So what you end up seeing is interactive process related work being queued on CPU 3 which is then being starved of processing by XFS IO completion work on the same CPU. By setting the xfslogd workqueue to unbound, it no longer holds precedence over any of the other work and hence does not starve the queued interactive process related work. That's why the patch "fixes" your 1-2s hangs while log IO is in progress, but it will eventually deadlock and hang the filesystem permanently. There are two different things you can try with the wq initialisation that might help prevent the problem. Firstly, try this: - WQ_MEM_RECLAIM | WQ_HIGHPRI, 1); + WQ_MEM_RECLAIM | WQ_HIGHPRI, 8); To change the number of concurrent per-cpu work items that can be processed on the work queue. I don't think that will fix the inversion, but it may allow more concurrency which hides the inversion. The other thing you can try is: - WQ_MEM_RECLAIM | WQ_HIGHPRI, 1); + WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_CPU_INTENSIVE, 1); Which marks the workqueue as one that will consume a lot of CPU and hence it is scheduled differently and hence should avoid other pending work from being starved. We use this WQ_CPU_INTENSIVE flag on other XFS workqueues that are known to consume lots of CPU, so I suspect this is the right thing to do here. Hopefully one of these things will point to a potential solution. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs