Re: Debug of xfstest 234 hang

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 11 Nov 2011 09:19:37 +1100

On Thu, Nov 10, 2011 at 03:16:33PM -0600, Chandra Seetharaman wrote:
> Hi Dave, Alex,
> 
> Debugging using trace, crash and systemtap, I found that the hang
> happens when xfs_sync_worker() (thru kworker) gets stuck in xlog_wait()
> while reserving a transaction log buffer for the dummy log.

Yes, that much I already knew from stack traces posted.

> I also found that even though xfsaild_push() keeps getting invoked, it
> doesn't do anything to push the log to the disk, since the
> ailp->xa_target has not been changed since it has been called from the
> process stack a while back.

Which is normal behaviour - if the log head/tail have't moved, then the
calculated push target (25 % of log space free) will not change
during the transaction reservation.

> So, I thought, resetting the target to the max value would help nudge
> the flow of ail to the disk. So, I added the following code. 
> ------------------
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index ed9252b..f59fd9f 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -534,6 +534,10 @@ out_done:
>  		ailp->xa_last_pushed_lsn = 0;
>  	}
>  
> +	lsn = xfs_ail_max_lsn(ailp);
> +	smp_wmb();
> +	xfs_trans_ail_copy_lsn(ailp, &ailp->xa_target, &lsn);
> +	smp_wmb();
>  	return tout;
>  }
> --------------------
> 
> and it seem to do the magic.

which means anyone who pushes on the AIL *always empties it*. That's
extremely bad from a performance persepctive because it means we no
longer have any concept of delaying metadata writeback until it is
really necessary.

Basically all this does is prevent the log space usage from hanging
around the threshold point that causes the hang. i.e. instead of
keeping the log roughly 25% free, this change means we get to 75%
full then try to empty it completely. The hang is occurring due to
some as yet unknown flaw in the threshold logic, so all your change
does is make it harder to hit because the filesysetm spends less
time at the threshold.

A better way to acheive this woulxp dbe to push the AIL in
xfs_sync_worker() before trying to cover the log. i.e. change the
code to do this:

 	if (!(mp->m_flags & XFS_MOUNT_RDONLY)) {
+		/* start pushing all the metadata that is currently dirty */
+		xfs_ail_push_all(mp->m_ail);
+
 		/* dgc: errors ignored here */
 		if (mp->m_super->s_frozen == SB_UNFROZEN &&
 		    xfs_log_need_covered(mp))
 			error = xfs_fs_log_dummy(mp);
 		else
 			xfs_log_force(mp, 0);
 		error = xfs_qm_sync(mp, SYNC_TRYLOCK);
-
-		/* start pushing all the metadata that is currently dirty */
-		xfs_ail_push_all(mp->m_ail);
 	}

 	/* queue us up again */

This will guarantee the push target is moved before the sync worker
tries to do a transaction reservation and potentially hangs waiting
for space.

However, this is still just a workaround for whatever threshold
value problem we have that is the root cause of the problem. Hence
what I'd really like to know is where the grant head/log tail values
are at this point when the hang occurs (the log event tracing has
all the necessary values) so we can get an idea of the condition
that we are stuck in. Being able to view the progression of log
reservations and releases leading up to the hang (once again from
the event trace) might point out exactly what condition we are
getting stuck in, and that's what we nee dto understand first before
trying to fix the problem.

The log reserve tracepoints give output like this:

           <...>-8903  [002] 241733.544531: xfs_log_reserve:      dev 253:16 type QM_SETQLIM t_ocnt 1 t_cnt 1 t_curr_res 2332 t_unit_res 2332 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 105116 grant_write_cycle 1 grant_write_bytes 105116 curr_cycle 1 curr_block 23 tail_cycle 1 tail_block 2
           <...>-8903  [002] 241733.544531: xfs_log_grant_enter:  dev 253:16 type QM_SETQLIM t_ocnt 1 t_cnt 1 t_curr_res 2332 t_unit_res 2332 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 105116 grant_write_cycle 1 grant_write_bytes 105116 curr_cycle 1 curr_block 23 tail_cycle 1 tail_block 2
           <...>-8903  [002] 241733.544531: xfs_log_grant_exit:   dev 253:16 type QM_SETQLIM t_ocnt 1 t_cnt 1 t_curr_res 2332 t_unit_res 2332 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 107448 grant_write_cycle 1 grant_write_bytes 107448 curr_cycle 1 curr_block 23 tail_cycle 1 tail_block 2

which show all the information about the grant/write head locations,
the current tail location, the transaction type and reservation,
etc. When the transaction is committed we then see:

           <...>-8894  [002] 241733.544536: xfs_log_done_nonperm: dev 253:16 type QM_SETQLIM t_ocnt 1 t_cnt 1 t_curr_res 2180 t_unit_res 2332 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 107448 grant_write_cycle 1 grant_write_bytes 107448 curr_cycle 1 curr_block 23 tail_cycle 1 tail_block 2
           <...>-8894  [002] 241733.544536: xfs_log_ungrant_enter: dev 253:16 type QM_SETQLIM t_ocnt 1 t_cnt 0 t_curr_res 2180 t_unit_res 2332 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 107448 grant_write_cycle 1 grant_write_bytes 107448 curr_cycle 1 curr_block 23 tail_cycle 1 tail_block 2
           <...>-8894  [002] 241733.544536: xfs_log_ungrant_sub:  dev 253:16 type QM_SETQLIM t_ocnt 1 t_cnt 0 t_curr_res 2180 t_unit_res 2332 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 107448 grant_write_cycle 1 grant_write_bytes 107448 curr_cycle 1 curr_block 23 tail_cycle 1 tail_block 2
           <...>-8894  [002] 241733.544537: xfs_log_ungrant_exit: dev 253:16 type QM_SETQLIM t_ocnt 1 t_cnt 0 t_curr_res 2180 t_unit_res 2332 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 105268 grant_write_cycle 1 grant_write_bytes 105268 curr_cycle 1 curr_block 23 tail_cycle 1 tail_block 2

Where the remaining reservation is returned to the log.

>From these traces, we can see how much space is being used by each
transaction (t_unit_res - t_curr_res), whether the grant head/write
head change apprpropriately, whether there are queued waiters, where
the tail is, where the last iclog was written to, etc. By looking at
the various values, the threshold problem might become obvious.  I
tend to write quick throw-away awk scripts to extract the necessary
fields into a plottable formats and run those through something like
gnuplot or kst to see if there is any visible artifacts that might
be the cause of the problem...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs