On Fri, May 18, 2012 at 09:42:37AM -0500, Mark Tinguely wrote: > On 05/18/12 05:10, Dave Chinner wrote: > >Still, this doesn't explain the hang at all - the CIL forms a new > >list every time a checkpoint occurs, and this corruption would cause > >a crash trying to walk the li_lv list when pushed. So it comes back > >to why hasn't the CIL been pushed? what does the CIL context > >structure look like? > > The CIL context on the machine that was running 3+ days before hanging. > > struct xfs_cil_ctx { > cil = 0xffff88034a8c5240, > sequence = 1241833, > start_lsn = 0, > commit_lsn = 0, > ticket = 0xffff88034e0ebc08, > nvecs = 237, > space_used = 39964, > busy_extents = { > next = 0xffff88034b287958, > prev = 0xffff88034d10c698 > }, > lv_chain = 0x0, > log_cb = { > cb_next = 0x0, > cb_func = 0, > cb_arg = 0x0 > }, > committing = { > next = 0xffff88034c84d120, > prev = 0xffff88034c84d120 > } > } And the struct xfs_cil itself? > Start the cleaning of the log when still full after last clean. > --- > fs/xfs/xfs_log.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > Index: b/fs/xfs/xfs_log.c > =================================================================== > --- a/fs/xfs/xfs_log.c > +++ b/fs/xfs/xfs_log.c > @@ -191,8 +191,10 @@ xlog_grant_head_wake( > > list_for_each_entry(tic, &head->waiters, t_queue) { > need_bytes = xlog_ticket_reservation(log, head, tic); > - if (*free_bytes < need_bytes) > + if (*free_bytes < need_bytes) { > + xlog_grant_push_ail(log, need_bytes); Ok, so that means every time the log tail is moved or a transaction completes and returns unused space to the grant head, it pushes the AIL target along. But if we are hanging with an empty AIL, this is not actually doing anything of note, just changing timing to make whatever problem we have less common. I'd remove this patch to make reproducing the problem easier.... We've almost certainly got a CIL hang, and it looks like it is being caused by an accounting leak. i.e. if the CIL hasn't reached it's push threshold (12.5% of the log space), but the AIL is empty and we have the grant heads indicating that there is less than 25% of the log space free, we are slowly leaking log space somewhere in the CIL commit or checkpoint path. Given that we've done 1.24 million checkpoints in the above example, it's not a common thing. Given the size of log, it may be related to log wrap commits, and it is also worth noting that if this an accounting leak, it will eventually result in a hard hang. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs