Re: long hangs when deleting large directories (3.0-rc3)

Markus Trippelsdorf <markus@xxxxxxxxxxxxxxx> · Wed, 22 Jun 2011 09:40:43 +0200

On 2011.06.22 at 17:30 +1000, Dave Chinner wrote:
> On Wed, Jun 22, 2011 at 09:06:47AM +0200, Markus Trippelsdorf wrote:
> > On 2011.06.22 at 10:04 +1000, Dave Chinner wrote:
> > > On Tue, Jun 21, 2011 at 08:57:01PM +0200, Markus Trippelsdorf wrote:
> > > > On 2011.06.21 at 20:24 +0200, Markus Trippelsdorf wrote:
> > > > > On 2011.06.21 at 10:02 +0200, Markus Trippelsdorf wrote:
> > > > > > On 2011.06.21 at 14:25 +1000, Dave Chinner wrote:
> > > > > > > That is, you really need to get a profile of the rm -rf process - or
> > > > > > > whatever is consuming the CPU time - (e.g. via perf or ftrace)
> > > > > > > across the hang to so we can narrow down the potential cause of the
> > > > > > > latency. Speaking of which, latencytop might be helpful in
> > > > > > > identifying where input is getting held up....
> > > > > > 
> > > > > > I've recorded a profile with "perf record -g /home/markus/rm_sync"
> > > > > > ~ % cat rm_sync
> > > > > > rm -fr /mnt/tmp/tmp/linux && sync
> > > > > 
> > > > > FWIW here are two links to svg time-charts produced by:
> > > > > 
> > > > > perf timechart record /home/markus/rm_sync
> > > > > 
> > > > > http://trippelsdorf.de/timechart1.svg
> > > > > http://trippelsdorf.de/timechart2.svg
> > > > > 
> > > > 
> > > > And this is what the mysterious kworker is doing during the sync.
> > > > It's the one consuming most of the CPU time.
> > > > 
> > > >     39.96%      kworker/3:0  [kernel.kallsyms]                            0xffffffff811da9da k [k] xfs_trans_ail_update_bulk
> > > >                 |
> > > >                 --- xfs_trans_ail_update_bulk
> > > >                     xfs_trans_committed_bulk
> > > >                     xlog_cil_committed
> > > >                     xlog_state_do_callback
> > > >                     xlog_state_done_syncing
> > > >                     xlog_iodone
> > > >                     xfs_buf_iodone_work
> > > >                     process_one_work
> > > >                     worker_thread
> > > >                     kthread
> > > >                     kernel_thread_helper
> > > 
> > > So that is inserting items into the AIL at transaction completion.
> > > That can consume lots of CPU time if the CIL checkpoints are being
> > > flushed quickly enough. Given you are doing a rm -rf at this point
> > > in time, I'd expect to see this trace present in the profile, though
> > > maybe not at that extent.
> > > 
> > > I have seen this before but have never been able to it reproduce
> > > reliably.  If checkpoints are completed "out of order" due to the
> > > way the commit records are written into the iclogs. This can cause
> > > extra CPU because the AIL insertion then has to skip over all the
> > > items that were inserted out of order before splicing each group of
> > > items into the AIL. I only rarely see this (maybe once every couple
> > > of weeks of performance testing), so I'm not sure it's the problem
> > > you are seeing.
> > > 
> > > Adding this debug to xfs_ail_splice() list walk will tell us if this is
> > > happening and how many items it had to walk when you see a hang:
> > > 
> > >  	xfs_lsn_t       lsn)
> > >  {
> > >  	xfs_log_item_t  *next_lip;
> > > +	int walked = 0;
> > >  
> > >  	/* If the list is empty, just insert the item.  */
> > >  	if (list_empty(&ailp->xa_ail)) {
> > >  		list_splice(list, &ailp->xa_ail);
> > >  		return;
> > >  	}
> > >  
> > >  	list_for_each_entry_reverse(next_lip, &ailp->xa_ail, li_ail) {
> > >  		if (XFS_LSN_CMP(next_lip->li_lsn, lsn) <= 0)
> > >  			break;
> > > +		if (!walked++) {
> > > +			xfs_warn(ailp->xa_mount,
> > > +			 "ail: ooo splice, tail 0x%llx, item 0x%llx\n",
> > > +				next_lip->li_lsn, lsn);
> > > +		}
> > >  	}
> > > +	if (walked > 10) {
> > > +		xfs_warn(ailp->xa_mount,
> > > +			 "ail: ooo splice, walked %d items\n", walked);
> > > +	}
> > >  
> > >  	ASSERT(&next_lip->li_ail == &ailp->xa_ail ||
> > >  	       XFS_LSN_CMP(next_lip->li_lsn, lsn) <= 0);
> > > 
> > > That will at least tell us if this is the cause of your problem. If
> > > it is, I think I know how to avoid most of the list walk overhead
> > > fairly easily and that should avoid the need to change workqueue
> > > configurations at all.
> > 
> > The kernel log is attached.
> 
> Ok, so that is the cause of the problem∵ THe 3 seconds of output
> where it is nothing but:
> 
> Jun 22 08:53:09 x4 kernel: XFS (sdb1): ail: ooo splice, tail 0x12000156e7, item 0x12000156e6
> Jun 22 08:53:09 x4 kernel: XFS (sdb1): ail: ooo splice, walked 15503 items      
> .....
> Jun 22 08:53:12 x4 kernel: XFS (sdb1): ail: ooo splice, tail 0x12000156e7, item 0x12000156e6
> Jun 22 08:53:12 x4 kernel: XFS (sdb1): ail: ooo splice, walked 16945 items
> 
> Interesting is the LSN of the tail - it's only one sector further on
> than the items being inserted. That's what I'd expect from a commit
> record write race between two checkpoints. I'll have a deeper look
> into whether this can be avoided later tonight and also whether I
> can easily implement a "last insert cursor" easily so subsequent
> inserts at the same LSN avoid the walk....
> 
> > > > The following patch fixes the problem for me.
> .....
> > > There are two different things you can try with the wq
> > > initialisation that might help prevent the problem. Firstly, try
> > > this:
> > > 
> > > -					WQ_MEM_RECLAIM | WQ_HIGHPRI, 1);
> > > +					WQ_MEM_RECLAIM | WQ_HIGHPRI, 8);
> > > 
> > > To change the number of concurrent per-cpu work items that can be
> > > processed on the work queue. I don't think that will fix the
> > > inversion, but it may allow more concurrency which hides the
> > > inversion.
> > > 
> > > The other thing you can try is:
> > > 
> > > -					WQ_MEM_RECLAIM | WQ_HIGHPRI, 1);
> > > +					WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_CPU_INTENSIVE, 1);
> > > 
> > > Which marks the workqueue as one that will consume a lot of CPU and
> > > hence it is scheduled differently and hence should avoid other
> > > pending work from being starved. We use this WQ_CPU_INTENSIVE flag
> > > on other XFS workqueues that are known to consume lots of CPU, so I
> > > suspect this is the right thing to do here.
> > 
> > Yes, that was the next combination I've tested after WQ_UNBOUND and it
> > solves the issue, too.
> 
> Just to be clear, which combination did you test? The
> WQ_CPU_INTENSIVE one?

Sorry, the "WQ_HIGHPRI | WQ_CPU_INTENSIVE" one.

-- 
Markus

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs