Re: xfs umount hang in xfs_ail_push_all_sync on i/o error

Bob Mastors <bob.mastors@xxxxxxxxxxxxx> · Fri, 2 May 2014 18:47:21 -0600

On Fri, May 2, 2014 at 6:05 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:

On Fri, May 02, 2014 at 11:32:19AM -0600, Bob Mastors wrote:

> On Tue, Apr 29, 2014 at 9:22 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:

> > Well, I guess I should been more specific. ftrace is pretty much

> > useless here - we need the information taht is recorded by the

> > events, not the timing of the events.

> >

> Heh, I have been trying to bluff my way through this.

> I have hacked on file systems but am new to linux and xfs.

I wouldn't have guessed that. Most of the time the response is

"trace events? what are they?". :)

> Output file attached but it all looks like this:

>

>      xfsaild/sdb-3613  [001]   445.757616: xfs_ail_flushing:     dev 8:16

> lip 0x0xffff88001d3017b8 lsn 32/3573 type XFS_LI_INODE flags IN_AIL

>      xfsaild/sdb-3613  [001]   445.757617: xfs_ail_flushing:     dev 8:16

> lip 0x0xffff88001d301850 lsn 32/3573 type XFS_LI_INODE flags IN_AIL

>      xfsaild/sdb-3613  [001]   445.757618: xfs_ail_flushing:     dev 8:16

> lip 0x0xffff88001de63ed8 lsn 32/3573 type XFS_LI_INODE flags IN_AIL

>      xfsaild/sdb-3613  [001]   445.757618: xfs_ail_flushing:     dev 8:16

> lip 0x0xffff88001cd24720 lsn 32/3573 type XFS_LI_INODE flags IN_AIL

>      xfsaild/sdb-3613  [001]   445.757618: xfs_ail_flushing:     dev 8:16

> lip 0x0xffff88001cd248e8 lsn 32/3573 type XFS_LI_INODE flags IN_AIL

>      xfsaild/sdb-3613  [001]   445.757619: xfs_ail_flushing:     dev 8:16

> lip 0x0xffff88002c435ab0 lsn 32/3573 type XFS_LI_INODE flags IN_AIL

That's very enlightening - a bunch of inodes returning

XFS_ITEM_FLUSHING from attempts to push them.

> The addresses repeat.

> I started the record as soon as I noticed the umount hanging, probably

> within 20 seconds or so.

> Also attached is the console output.

>

> This was done with your patch applied.

>

> Seems like xfsaild_push is missing a check that the file system is

> XFS_FORCED_SHUTDOWN.

It might look that way, but what is supposed to happen is that

everything else is supposed to handle shutdown situations and empty

the AIL. i.e. the act of issuing IO by the AIL is supposed to

trigger shutdown processing on the objects, and so having the AIL

shut down itself won't help.

So, what's the problem? Well, the inodes have been flushed,

triggering the inode buffer to be written. However, then inode

buffer has had an IO error (i.e. the async write error), is retried,

and then we stop retrying the IO. The idea is that the AIL will push

it again soon and the IO will be retried.

What is happening is that the inodes in the buffer are already

marked as "being flushed" and hence the AIL does not push the inode

buffer again, and hence the IO completion callbacks on the buffer

that will remove the inodes from the AIL are never called. Not sure

how to fix this one yet....
Thanks for the explanation.
It clears up some questions I had.

> Let me know if there is any other info I can capture that would be

> interesting.

I know what the problem is now, so testing a fix will be the next

step. But I need to think about how to fix this, and that can wait

till Monday. ;)
Just an fyi next week is busy for me with some other stuff so I
won't be able to try out a fix until the following week.

Thanks for following up with the trace information, Bob!
Not a problem.

Thanks,
Bob

Cheers,

Dave.

--

Dave Chinner

david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs