Hi Shyam, do you mind to share your systemtap script with us? I'd like to take a look on it, and probably Brian will be interested to. On Fri, Apr 08, 2016 at 07:05:24PM +0530, Shyam Kaushik wrote: > Hi Brian, > > Yes the below report is a bit different than the original report. I wanted > to confirm if the new patches from Dave will see these odd > crashes/hung-task. > > I applied Dave's patch & managed to recreate the original issue. With > systemtap I do see that FS has been marked for shutdown (xfs_aildpush() - > mp:0xffff880062d09800 m_flags:4325656), but still the infinite tries keep > happening with xfsaild_push() seeing XFS_IFLOCK on the inode. Pls note > that this issue happens only when I see > > XFS (dm-10): Detected failing async write on buffer block 0x368d76a0. > Retrying async write. > > in the logs. In all other cases, this problem doesn't happen. So there is > something with this async-write retries that is leaving the state of the > inode with IFLOCK that rest of aildpush is not able to handle. > > --Shyam > > -----Original Message----- > From: Brian Foster [mailto:bfoster@xxxxxxxxxx] > Sent: 08 April 2016 18:47 > To: Shyam Kaushik > Cc: david@xxxxxxxxxxxxx; xfs@xxxxxxxxxxx; Alex Lyakas > Subject: Re: XFS hung task in xfs_ail_push_all_sync() when unmounting FS > after disk failure/recovery > > On Fri, Apr 08, 2016 at 04:21:02PM +0530, Shyam Kaushik wrote: > > Hi Dave, Brian, Carlos, > > > > While trying to reproduce this issue I have been running into different > > issues that are similar. Underlying issue remains the same when backend > to > > XFS is failed & we unmount XFS, we run into hung-task timeout (180-secs) > > with stack like > > > > kernel: [14952.671131] [<ffffffffc06a5f59>] > > xfs_ail_push_all_sync+0xa9/0xe0 [xfs] > > kernel: [14952.671139] [<ffffffff810b26b0>] ? > > prepare_to_wait_event+0x110/0x110 > > kernel: [14952.671181] [<ffffffffc0690111>] xfs_unmountfs+0x61/0x1a0 > > [xfs] > > > > while running trace-events, XFS ail push keeps looping around > > > > xfsaild/dm-10-21143 [001] ...2 17878.555133: xfs_ilock_nowait: dev > > 253:10 ino 0x0 flags ILOCK_SHARED caller xfs_inode_item_push [xfs] > > xfsaild/dm-10-21143 [001] ...2 17878.555138: xfs_iunlock: dev 253:10 > > ino 0x0 flags ILOCK_SHARED caller xfs_inode_item_push [xfs] > > xfsaild/dm-10-21143 [001] ...2 17878.555138: xfs_ail_push: dev 253:10 > > lip 0xffff880031f6ae40 lsn 1/73448 type XFS_LI_INODE flags IN_AIL > > xfsaild/dm-10-21143 [001] ...2 17878.605136: xfs_ilock_nowait: dev > > 253:10 ino 0x0 flags ILOCK_SHARED caller xfs_inode_item_push [xfs] > > xfsaild/dm-10-21143 [001] ...2 17878.605141: xfs_iunlock: dev 253:10 > > ino 0x0 flags ILOCK_SHARED caller xfs_inode_item_push [xfs] > > xfsaild/dm-10-21143 [001] ...2 17878.605142: xfs_ail_push: dev 253:10 > > lip 0xffff880031f6ae40 lsn 1/73448 type XFS_LI_INODE flags IN_AIL > > > > observe ino==0x0 (its a freed inode) but the corresponding lip stays > with > > xfsaild push. I had run a systemtap when this problem happened & the > > following keeps repeating. > > > > That's interesting, and looks different from the original report in > terms of the inode number being cleared. The original report looks like > it has a valid inode and there's some problematic sequence where it's > not being removed from the AIL in the event of errors. > > I think at this point we know that XFS attempts to retry these I/Os > indefinitely. Dave has outstanding patches to deal with this issue. The > question Dave raised above was whether the filesystem shut down, and if > not, why (as a failed log I/O should cause a shutdown)? Carlos was > looking into this it appears... > > > xfs_inode_item_push() - inode:0xffff88003d8ca000 m_flags:0x420118 > > flags:0x1 > > xfs_iflush() - inode:ffff88003d8ca000 aborting for forced shutdown > > xfs_iflush_abort() - iip:0x0 > > > > xfs_inode_item_push() is doing xfs_iflush() but it sees FS_SHUTDOWN flag > > marked & does an xfs_iflush_abort().xfs_iflush_abort() sees > i_itemp==NULL > > on the lip & doesn't do anything. So we have this stale lip that is not > > associated to inode anymore that keeps xfs_ail_push_all_sync() busy. > > > > On few other occurrences, i have had NULL pointer/dereference or other > > sorts of crashes at this line > > > > xfsaild_push() > > lock_result = lip->li_ops->iop_push(lip, > > &ailp->xa_buf_list); > > > > with debug prints, in one of the occurrence lip->li_ops was NULL & in > > another lip->li_ops was pointing to a bad pointer that subsequent > > dereference crashed it. This also indicates that a bad/freed lip was > > inserted that xfsaild_push() is working. > > > > I hit upon Dave's patch "xfs: xfs_inode_free() isn't RCU safe" & > realized > > that this could explain the above issue where a lip that has been freed > is > > wrongly referenced & later we could even have the inode disconnected. > What > > do you think? > > > > In any case I uploaded the full xfs trace events before this issue > started > > & till the end. It is at > > https://www.dropbox.com/s/21qstz4ld1gn8yi/xfs_trace_pipe.gz?dl=0 > > > > It seems like it could be related. I didn't catch anything obvious from > the trace, but there's a lot of data there. The RCU-unsafe issue was > difficult to track down without instrumentation.. I'm not sure that will > be evident from the trace. The best thing to do wrt to that might be to > try with Dave's patches, as that so far appears to address the problem. > (In fact, it might be worth it to try Dave's shutdown on umount patch > he referred to up-thread to address the original problem as well). > > Brian > > > Pls let me know. Thanks. > > > > --Shyam > > > > -----Original Message----- > > From: Shyam Kaushik [mailto:shyam@xxxxxxxxxxxxxxxxx] > > Sent: 23 March 2016 15:23 > > To: 'Brian Foster' > > Cc: 'xfs@xxxxxxxxxxx'; Alex Lyakas > > Subject: RE: XFS hung task in xfs_ail_push_all_sync() when unmounting FS > > after disk failure/recovery > > > > Hi Carlos, > > > > w.r.t. your question below > > > > >>> Shyam, after you reconnected the disks, the messages about failed > > async metadata > > >>> writes stops to be logged? > > > > After I reconnect the disks, messages about failed async metadata writes > > stops to be logged. But the key thing is messages like > > > > XFS (dm-29): xfs_log_force: error -5 returned > > > > Keeps repeating every 30-secs which indicates that there is some buf/io > > error status that is being carried forward. > > > > >>> Any chance you can reliably reproduce it? > > > > Yes I have a way to reproduce it, but its not reliable. What I do is > setup > > a dm-linear over a disk. Create XFS, mount & trigger few copy threads to > > copy varying sized files into the FS. At this point pull out the drive > > (with scsi-remove-single-device in /proc/scsi/scsi) & in a short-while > > convert the dm-linear to dm-ioerror. Then I bring back the underlying > > drive, convert back dm-ioerror to dm-linear & try to unmount XFS. > > > > This problem somehow happens on a newly created XFS. If I copy several > > files into XFS/delete them & then copy them again, repeat the drive > > failure/recovery experiment it doesn't reproduce. > > > > Thanks. > > > > --Shyam > > > > Subject: Re: XFS hung task in xfs_ail_push_all_sync() when unmounting FS > > after disk failure/recovery > > From: Carlos Maiolino <cmaiolino@xxxxxxxxxx> > > Date: Tue, 22 Mar 2016 16:38:25 +0100 > > In-reply-to: <20160322140345.GA54245@bfoster.bfoster> > > Mail-followup-to: xfs@xxxxxxxxxxx > > User-agent: Mutt/1.5.24 (2015-08-30) > > > > Hi Brian, > > > > These traces, and the stack trace presented, looks quite similar with > the > > one we were discussing a few days ago, using a dm-thin snapshot. > > > > Looks like with the same bug I've been hunting and Shyam confirmed my > > hypothesis > > of this bug be able to be reproduced with a regular device. > > > > If it's the same bug, yes, I reproduced it using upstream kernel. > > > > The difference between both (this bug and the one I've been working on) > is > > how > > xfs actually behaves when async metadata writes fail. Other than that, > it > > pretty > > much looks the same. > > > > Trying to unmount the filesystem hungs in xfs_log_force(), well, > basically > > the > > reason I submitted the patch to include the caller into xfs_log_force > > trace. I'd > > like to see ftrace traces from this system with that patch if possible. > > > > I didn't have time to keep working on this for the past few days, but > > looks like > > it's time to come back to it. > > > > Shyam, after you reconnected the disks, the messages about failed async > > metadata > > writes stops to be logged? > > > > Any chance you can reliably reproduce it? > > > > I'm not a xfs journal expert, but it looks like the writeback of items > in > > AIL > > got stuck due the IO errors, and were never completed, but I don't know > > what I > > should expect after the disk is reconnected. > > > > In my case though, with upstream kernel, I didn't get a XFS_SHUTDOWN > until > > I > > tried to unmount the filesystem, which differs from this case, where xfs > > looks > > to have shutdown the filesystem after a few tries to writeback the > > metadata. > > > > Anyway, I can dig more into it this week, if nobody knows what is going > on > > before I do it :) > > > > -----Original Message----- > > From: Shyam Kaushik [mailto:shyam@xxxxxxxxxxxxxxxxx] > > Sent: 22 March 2016 18:32 > > To: 'Brian Foster' > > Cc: 'xfs@xxxxxxxxxxx'; Alex Lyakas > > Subject: RE: XFS hung task in xfs_ail_push_all_sync() when unmounting FS > > after disk failure/recovery > > > > Hi Brian, > > > > Thanks for your quick reply. I repeated the test & trace-pipe is > > constantly filled with this: > > > > xfsaild/dm-10-3335 [003] ...2 24890.546491: xfs_ilock_nowait: dev > > 253:10 ino 0xc0 flags ILOCK_SHARED caller xfs_inode_item_push [xfs] > > xfsaild/dm-10-3335 [003] ...2 24890.546492: xfs_iunlock: dev 253:10 > > ino 0xc0 flags ILOCK_SHARED caller xfs_inode_item_push [xfs] > > xfsaild/dm-10-3335 [003] ...2 24890.546493: xfs_ail_flushing: dev > > 253:10 lip 0xffff8800a9f437b8 lsn 1/38624 type XFS_LI_INODE flags IN_AIL > > xfsaild/dm-10-3335 [003] ...2 24890.596491: xfs_ilock_nowait: dev > > 253:10 ino 0xc0 flags ILOCK_SHARED caller xfs_inode_item_push [xfs] > > xfsaild/dm-10-3335 [003] ...2 24890.596492: xfs_iunlock: dev 253:10 > > ino 0xc0 flags ILOCK_SHARED caller xfs_inode_item_push [xfs] > > xfsaild/dm-10-3335 [003] ...2 24890.596494: xfs_ail_flushing: dev > > 253:10 lip 0xffff8800a9f437b8 lsn 1/38624 type XFS_LI_INODE flags IN_AIL > > xfsaild/dm-10-3335 [003] ...2 24890.646497: xfs_ilock_nowait: dev > > 253:10 ino 0xc0 flags ILOCK_SHARED caller xfs_inode_item_push [xfs] > > xfsaild/dm-10-3335 [003] ...2 24890.646498: xfs_iunlock: dev 253:10 > > ino 0xc0 flags ILOCK_SHARED caller xfs_inode_item_push [xfs] > > xfsaild/dm-10-3335 [003] ...2 24890.646500: xfs_ail_flushing: dev > > 253:10 lip 0xffff8800a9f437b8 lsn 1/38624 type XFS_LI_INODE flags IN_AIL > > xfsaild/dm-10-3335 [003] ...2 24890.696467: xfs_ilock_nowait: dev > > 253:10 ino 0xc0 flags ILOCK_SHARED caller xfs_inode_item_push [xfs] > > xfsaild/dm-10-3335 [003] ...2 24890.696468: xfs_iunlock: dev 253:10 > > ino 0xc0 flags ILOCK_SHARED caller xfs_inode_item_push [xfs] > > xfsaild/dm-10-3335 [003] ...2 24890.696468: xfs_ail_flushing: dev > > 253:10 lip 0xffff8800a9f437b8 lsn 1/38624 type XFS_LI_INODE flags IN_AIL > > xfsaild/dm-10-3335 [003] ...2 24890.746548: xfs_ilock_nowait: dev > > 253:10 ino 0xc0 flags ILOCK_SHARED caller xfs_inode_item_push [xfs] > > xfsaild/dm-10-3335 [003] ...2 24890.746550: xfs_iunlock: dev 253:10 > > ino 0xc0 flags ILOCK_SHARED caller xfs_inode_item_push [xfs] > > xfsaild/dm-10-3335 [003] ...2 24890.746550: xfs_ail_flushing: dev > > 253:10 lip 0xffff8800a9f437b8 lsn 1/38624 type XFS_LI_INODE flags IN_AIL > > xfsaild/dm-10-3335 [003] ...2 24890.796479: xfs_ilock_nowait: dev > > 253:10 ino 0xc0 flags ILOCK_SHARED caller xfs_inode_item_push [xfs] > > xfsaild/dm-10-3335 [003] ...2 24890.796480: xfs_iunlock: dev 253:10 > > ino 0xc0 flags ILOCK_SHARED caller xfs_inode_item_push [xfs] > > xfsaild/dm-10-3335 [003] ...2 24890.796480: xfs_ail_flushing: dev > > 253:10 lip 0xffff8800a9f437b8 lsn 1/38624 type XFS_LI_INODE flags IN_AIL > > xfsaild/dm-10-3335 [003] ...2 24890.846467: xfs_ilock_nowait: dev > > 253:10 ino 0xc0 flags ILOCK_SHARED caller xfs_inode_item_push [xfs] > > xfsaild/dm-10-3335 [003] ...2 24890.846468: xfs_iunlock: dev 253:10 > > ino 0xc0 flags ILOCK_SHARED caller xfs_inode_item_push [xfs] > > > > > > while regular activity seems to happen on other inodes/kworker threads > > > > kworker/u8:4-27691 [001] ...1 24895.811474: xfs_writepage: dev > 253:10 > > ino 0x1801061 pgoff 0x29000 size 0x1aebbc offset 0 length 0 delalloc 1 > > unwritten 0 > > kworker/u8:4-27691 [001] ...1 24895.811477: xfs_invalidatepage: dev > > 253:10 ino 0x1801061 pgoff 0x29000 size 0x1aebbc offset 0 length 1000 > > delalloc 1 unwritten 0 > > kworker/u8:4-27691 [001] ...1 24895.811478: xfs_releasepage: dev > > 253:10 ino 0x1801061 pgoff 0x29000 size 0x1aebbc offset 0 length 0 > > delalloc 0 unwritten 0 > > kworker/u8:4-27691 [001] ...1 24895.811482: xfs_writepage: dev > 253:10 > > ino 0x4017bdf pgoff 0x29000 size 0x1aebbc offset 0 length 0 delalloc 1 > > unwritten 0 > > kworker/u8:4-27691 [001] ...1 24895.811482: xfs_invalidatepage: dev > > 253:10 ino 0x4017bdf pgoff 0x29000 size 0x1aebbc offset 0 length 1000 > > delalloc 1 unwritten 0 > > kworker/u8:4-27691 [001] ...1 24895.811483: xfs_releasepage: dev > > 253:10 ino 0x4017bdf pgoff 0x29000 size 0x1aebbc offset 0 length 0 > > delalloc 0 unwritten 0 > > kworker/u8:4-27691 [001] ...1 24895.811485: xfs_writepage: dev > 253:10 > > ino 0x68048c3 pgoff 0x29000 size 0x1aebbc offset 0 length 0 delalloc 1 > > unwritten 0 > > kworker/u8:4-27691 [001] ...1 24895.811486: xfs_invalidatepage: dev > > 253:10 ino 0x68048c3 pgoff 0x29000 size 0x1aebbc offset 0 length 1000 > > delalloc 1 unwritten 0 > > kworker/u8:4-27691 [001] ...1 24895.811486: xfs_releasepage: dev > > 253:10 ino 0x68048c3 pgoff 0x29000 size 0x1aebbc offset 0 length 0 > > delalloc 0 unwritten 0 > > kworker/u8:4-27691 [001] ...1 24895.812381: xfs_writepage: dev > 253:10 > > ino 0x1805e37 pgoff 0x29000 size 0x68470 offset 0 length 0 delalloc 1 > > unwritten 0 > > kworker/u8:4-27691 [001] ...1 24895.812382: xfs_invalidatepage: dev > > 253:10 ino 0x1805e37 pgoff 0x29000 size 0x68470 offset 0 length 1000 > > delalloc 1 unwritten 0 > > kworker/u8:4-27691 [001] ...1 24895.812382: xfs_releasepage: dev > > 253:10 ino 0x1805e37 pgoff 0x29000 size 0x68470 offset 0 length 0 > delalloc > > 0 unwritten 0 > > kworker/u8:4-27691 [001] ...1 24895.812385: xfs_writepage: dev > 253:10 > > ino 0x4019c95 pgoff 0x29000 size 0x68470 offset 0 length 0 delalloc 1 > > unwritten 0 > > kworker/u8:4-27691 [001] ...1 24895.812385: xfs_invalidatepage: dev > > 253:10 ino 0x4019c95 pgoff 0x29000 size 0x68470 offset 0 length 1000 > > delalloc 1 unwritten 0 > > > > > > looks like xfsaild is not able to take lock until hung-task timeout > kicks > > in > > > > xfsaild/dm-10-3335 [003] ...2 25247.649468: xfs_ilock_nowait: dev > > 253:10 ino 0xc0 flags ILOCK_SHARED caller xfs_inode_item_push [xfs] > > xfsaild/dm-10-3335 [003] ...2 25247.649469: xfs_iunlock: dev 253:10 > > ino 0xc0 flags ILOCK_SHARED caller xfs_inode_item_push [xfs] > > xfsaild/dm-10-3335 [003] ...2 25247.649469: xfs_ail_flushing: dev > > 253:10 lip 0xffff8800a9f437b8 lsn 1/38624 type XFS_LI_INODE flags IN_AIL > > xfsaild/dm-10-3335 [003] ...2 25247.699478: xfs_ilock_nowait: dev > > 253:10 ino 0xc0 flags ILOCK_SHARED caller xfs_inode_item_push [xfs] > > xfsaild/dm-10-3335 [003] ...2 25247.699516: xfs_iunlock: dev 253:10 > > ino 0xc0 flags ILOCK_SHARED caller xfs_inode_item_push [xfs] > > xfsaild/dm-10-3335 [003] ...2 25247.699517: xfs_ail_flushing: dev > > 253:10 lip 0xffff8800a9f437b8 lsn 1/38624 type XFS_LI_INODE flags IN_AIL > > xfsaild/dm-10-3335 [003] ...2 25247.749471: xfs_ilock_nowait: dev > > 253:10 ino 0xc0 flags ILOCK_SHARED caller xfs_inode_item_push [xfs] > > xfsaild/dm-10-3335 [003] ...2 25247.749478: xfs_iunlock: dev 253:10 > > ino 0xc0 flags ILOCK_SHARED caller xfs_inode_item_push [xfs] > > xfsaild/dm-10-3335 [003] ...2 25247.749479: xfs_ail_flushing: dev > > 253:10 lip 0xffff8800a9f437b8 lsn 1/38624 type XFS_LI_INODE flags IN_AIL > > > > Please let me know how to debug this further. Thanks. > > > > --Shyam > > -----Original Message----- > > From: Brian Foster [mailto:bfoster@xxxxxxxxxx] > > Sent: 22 March 2016 17:49 > > To: Shyam Kaushik > > Cc: xfs@xxxxxxxxxxx; Alex Lyakas > > Subject: Re: XFS hung task in xfs_ail_push_all_sync() when unmounting FS > > after disk failure/recovery > > > > On Tue, Mar 22, 2016 at 04:51:39PM +0530, Shyam Kaushik wrote: > > > Hi XFS developers, > > > > > > We are seeing the following issue with XFS on kernel 3.18.19. > > > > > > We have XFS mounted over a raw disk. Disk was pulled out manually. > There > > > were async writes on files that were errored like this > > > > > ... > > > > > > And XFS hit metadata & Log IO errors that it decides to shutdown: > > > > > > Mar 16 16:03:22 host0 kernel: [ 4637.351841] XFS (dm-29): metadata I/O > > > error: block 0x3a27fbd0 ("xlog_iodone") error 5 numblks 64 > > > Mar 16 16:03:22 host0 kernel: [ 4637.352820] XFS(dm-29): SHUTDOWN!!! > > > old_flags=0x0 new_flags=0x2 > > > Mar 16 16:03:22 host0 kernel: [ 4637.353187] XFS (dm-29): Log I/O > Error > > > Detected. Shutting down filesystem > > ... > > > Later the drive was re-inserted back. After the drive was re-inserted, > > XFS > > > was attempted to be unmounted > > > > > > Mar 16 16:16:53 host0 controld: [2557] [ ] umount[202] > > > : umount(/sdisk/vol5b0, xfs) > > > > > > But nothing happens except for the 30-secs xfs_log_force errors that > > keeps > > > repeating > > > > > ... > > > > > > This problem doesn't happen consistently, but happens periodically > with > > a > > > drive failure/recovery followed by XFS unmount. I couldn't find this > > issue > > > fixed in later kernels. Can you please suggest how I can debug this > > issue > > > further? > > > > > > > Similar problems have been reproduced due to racy/incorrect EFI/EFD > > object tracking, which are internal data structures associated with > > freeing extents. > > > > What happens if you enable tracepoints while the fs is in this hung > > unmount state? > > > > # trace-cmd start -e "xfs:*" > > # cat /sys/kernel/debug/tracing/trace_pipe > > > > Brian > > > > > Thanks! > > > > > > --Shyam > > > > > > _______________________________________________ > > > xfs mailing list > > > xfs@xxxxxxxxxxx > > > http://oss.sgi.com/mailman/listinfo/xfs > > > > _______________________________________________ > > xfs mailing list > > xfs@xxxxxxxxxxx > > http://oss.sgi.com/mailman/listinfo/xfs > > _______________________________________________ > xfs mailing list > xfs@xxxxxxxxxxx > http://oss.sgi.com/mailman/listinfo/xfs -- Carlos _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs