On Wed, Jul 26, 2017 at 10:52:50AM +0530, Shyam Kaushik wrote: > Hi Dave, > > From time to time we hit an xfs corruption when xattr leaf is attempted to > be written when it has ichdr.count==0. > > We are on an old kernel 3.18.19. We managed to capture xfs trace buffer > closer to the problem time. > > This is the corruption report > Jul 20 05:22:27.189727 node1 kernel: [466290.765299] XFS (dm-78): Metadata > corruption detected at xfs_attr3_leaf_write_verify+0xe5/0x100 [xfs], block > 0x9698734d8 > Jul 20 05:22:27.189741 node1 kernel: [466290.767279] XFS (dm-78): Unmount > and run xfs_repair > Jul 20 05:22:27.189743 node1 kernel: [466290.768022] XFS (dm-78): First 64 > bytes of corrupted metadata buffer: > Jul 20 05:22:27.191635 node1 kernel: [466290.768832] ffff88002b4bf000: 00 > 00 00 00 00 00 00 00 fb ee 00 00 00 00 00 00 ................ > Jul 20 05:22:27.192949 node1 kernel: [466290.770163] ffff88002b4bf010: 10 > 00 00 00 00 20 0f e0 00 00 00 00 00 00 00 00 ..... .......... > Jul 20 05:22:27.192953 node1 kernel: [466290.771264] ffff88002b4bf020: 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > Jul 20 05:22:27.195921 node1 kernel: [466290.772815] ffff88002b4bf030: 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > From xfs xattr debug trace, it appears that a file is created within a > directory having default ACL set. This causes file creation to insert xattr > leaf during file creation. This completes queuing an AIL. Immediately after > this an explicit xfs_setattr is being done on the same file which alters the > default ACL & this causes default ACL to be removed/added. It looks like > this 2nd set attr is somehow influencing the first AIL entry that it is > being seen with ichr.count==0 during the write. > I didn't notice anything on a quick pass through the code that would lead to this situation. It looks to me that the initial attr creates the leaf (where count is initialized to 0), but then adds the xattr in the same transaction (which should bump ichdr.count). The next set of SGI_ACL_FILE calls xfs_attr_leaf_add[_work]() before it rolls the transaction, which looks like it should bump ichdr.count again before we flip the inactive flags and remove the old attr. The output above clearly shows ichdr.count == 0, so I could easily be missing something or something else could be going on here. Unless somebody else chimes in with something more concrete, could you add a trace_printk() statement to xfs_attr3_leaf_verify() at the appropriate point where the verifier fails and re-collect the above tracepoint data? You could print out bp->b_bno just so we can correspond the log message to the tracepoint, but otherwise I'm just more curious to see where in the sequence of the xattr replace that the verifier happens to trigger and fail. Note that when you enable the attr tracepoints, you may have to include '-e print' to make sure the trace_printk() is enabled as well. E.g., something like 'trace-cmd start -e xfs:xfs_attr_* -e print' should work. Brian > I can pass the full xfs debug trace if required. Here is the snippet of > debug trace causing the issue > > <...>-18953 [004] ...1 466290.749358: xfs_attr_sf_create: dev > 253:78 ino 0x1749833903 name SGI_ACL_FILE namelen 12 valuelen 100 hashval > 0x275b19c4 op_flags ADDNAME|OKNOENT > <...>-18953 [004] ...1 466290.749359: xfs_attr_sf_addname: dev > 253:78 ino 0x1749833903 name SGI_ACL_FILE namelen 12 valuelen 100 hashval > 0x275b19c4 op_flags ADDNAME|OKNOENT > <...>-18953 [004] ...1 466290.749361: xfs_attr_sf_lookup: dev > 253:78 ino 0x1749833903 name SGI_ACL_FILE namelen 12 valuelen 100 hashval > 0x275b19c4 op_flags ADDNAME|OKNOENT > <...>-18953 [004] ...1 466290.749362: xfs_attr_sf_to_leaf: dev > 253:78 ino 0x1749833903 name SGI_ACL_FILE namelen 12 valuelen 100 hashval > 0x275b19c4 op_flags ADDNAME|OKNOENT > <...>-18953 [004] ...1 466290.749430: xfs_attr_leaf_create: dev > 253:78 ino 0x1749833903 name SGI_ACL_FILE namelen 12 valuelen 100 hashval > 0x275b19c4 op_flags ADDNAME|OKNOENT > <...>-18953 [004] ...1 466290.749454: xfs_attr_leaf_addname: dev > 253:78 ino 0x1749833903 name SGI_ACL_FILE namelen 12 valuelen 100 hashval > 0x275b19c4 op_flags ADDNAME|OKNOENT > <...>-18953 [004] ...1 466290.749456: xfs_attr_leaf_lookup: dev > 253:78 ino 0x1749833903 name SGI_ACL_FILE namelen 12 valuelen 100 hashval > 0x275b19c4 op_flags ADDNAME|OKNOENT > <...>-18953 [004] ...1 466290.749457: xfs_attr_leaf_add: dev > 253:78 ino 0x1749833903 name SGI_ACL_FILE namelen 12 valuelen 100 hashval > 0x275b19c4 op_flags ADDNAME|OKNOENT > <...>-18953 [004] ...1 466290.749458: xfs_attr_leaf_add_work: dev > 253:78 ino 0x1749833903 name SGI_ACL_FILE namelen 12 valuelen 100 hashval > 0x275b19c4 op_flags ADDNAME|OKNOENT > <...>-18953 [004] ...1 466290.749649: xfs_getattr: dev 253:78 ino > 0x1749833903 > <...>-18953 [004] ...1 466290.749908: xfs_attr_leaf_get: dev > 253:78 ino 0x1749833903 name NTACL namelen 5 valuelen 1024 hashval > 0xea9061c8 op_flags > <...>-18953 [004] ...1 466290.749910: xfs_attr_leaf_lookup: dev > 253:78 ino 0x1749833903 name NTACL namelen 5 valuelen 1024 hashval > 0xea9061c8 op_flags > <...>-18953 [004] ...1 466290.749918: xfs_getattr: dev 253:78 ino > 0x1749833903 > <...>-18953 [004] ...1 466290.749931: xfs_attr_leaf_get: dev > 253:78 ino 0x1749833903 name SAMBA_PAI namelen 9 valuelen 1024 hashval > 0x487716dd op_flags > <...>-18953 [004] ...1 466290.749932: xfs_attr_leaf_lookup: dev > 253:78 ino 0x1749833903 name SAMBA_PAI namelen 9 valuelen 1024 hashval > 0x487716dd op_flags > <...>-18953 [004] ...1 466290.749984: xfs_getattr: dev 253:78 ino > 0x1749833903 > <...>-18953 [004] ...1 466290.749990: xfs_getattr: dev 253:78 ino > 0x1749833903 > <...>-18953 [004] ...1 466290.750063: xfs_setattr: dev 253:78 ino > 0x1749833903 > <...>-18953 [004] ...1 466290.750257: xfs_attr_leaf_addname: dev > 253:78 ino 0x1749833903 name SGI_ACL_FILE namelen 12 valuelen 100 hashval > 0x275b19c4 op_flags ADDNAME|OKNOENT > <...>-18953 [004] ...1 466290.750260: xfs_attr_leaf_lookup: dev > 253:78 ino 0x1749833903 name SGI_ACL_FILE namelen 12 valuelen 100 hashval > 0x275b19c4 op_flags ADDNAME|OKNOENT > <...>-18953 [004] ...1 466290.750261: xfs_attr_leaf_replace: dev > 253:78 ino 0x1749833903 name SGI_ACL_FILE namelen 12 valuelen 100 hashval > 0x275b19c4 op_flags ADDNAME|OKNOENT > <...>-18953 [004] ...1 466290.750262: xfs_attr_leaf_add: dev > 253:78 ino 0x1749833903 name SGI_ACL_FILE namelen 12 valuelen 100 hashval > 0x275b19c4 op_flags RENAME|ADDNAME|OKNOENT > <...>-18953 [004] ...1 466290.750262: xfs_attr_leaf_add_work: dev > 253:78 ino 0x1749833903 name SGI_ACL_FILE namelen 12 valuelen 100 hashval > 0x275b19c4 op_flags RENAME|ADDNAME|OKNOENT > <...>-18953 [004] ...1 466290.750270: xfs_attr_leaf_flipflags: > dev 253:78 ino 0x1749833903 name SGI_ACL_FILE namelen 12 valuelen 100 > hashval 0x275b19c4 op_flags RENAME|ADDNAME|OKNOENT > <...>-18953 [004] ...1 466290.750275: xfs_attr_leaf_remove: dev > 253:78 ino 0x1749833903 name SGI_ACL_FILE namelen 12 valuelen 100 hashval > 0x275b19c4 op_flags RENAME|ADDNAME|OKNOENT > <...>-18953 [004] ...1 466290.750461: xfs_attr_leaf_removename: > dev 253:78 ino 0x1749833903 name NTACL namelen 5 valuelen 0 hashval > 0xea9061c8 op_flags OKNOENT > <...>-18953 [004] ...1 466290.750463: xfs_attr_leaf_lookup: dev > 253:78 ino 0x1749833903 name NTACL namelen 5 valuelen 0 hashval 0xea9061c8 > op_flags OKNOENT > <...>-18953 [004] ...1 466290.750474: xfs_attr_leaf_addname: dev > 253:78 ino 0x1749833903 name SAMBA_PAI namelen 9 valuelen 49 hashval > 0x487716dd op_flags ADDNAME|OKNOENT > <...>-18953 [004] ...1 466290.750475: xfs_attr_leaf_lookup: dev > 253:78 ino 0x1749833903 name SAMBA_PAI namelen 9 valuelen 49 hashval > 0x487716dd op_flags ADDNAME|OKNOENT > <...>-18953 [004] ...1 466290.750476: xfs_attr_leaf_add: dev > 253:78 ino 0x1749833903 name SAMBA_PAI namelen 9 valuelen 49 hashval > 0x487716dd op_flags ADDNAME|OKNOENT > <...>-18953 [004] ...1 466290.750476: xfs_attr_leaf_add_work: dev > 253:78 ino 0x1749833903 name SAMBA_PAI namelen 9 valuelen 49 hashval > 0x487716dd op_flags ADDNAME|OKNOENT > <...>-18953 [004] ...1 466290.750645: xfs_getattr: dev 253:78 ino > 0x1749833903 > <...>-18953 [004] ...1 466290.750648: xfs_getattr: dev 253:78 ino > 0x1749833903 > <...>-18953 [004] ...1 466290.750645: xfs_getattr: dev 253:78 ino > 0x1749833903 > <...>-18953 [004] ...1 466290.750648: xfs_getattr: dev 253:78 ino > 0x1749833903 > <...>-18953 [004] ...1 466290.750659: xfs_attr_leaf_get: dev > 253:78 ino 0x1749833903 name SAMBA_PAI namelen 9 valuelen 1024 hashval > 0x487716dd op_flags > <...>-18953 [004] ...1 466290.750661: xfs_attr_leaf_lookup: dev > 253:78 ino 0x1749833903 name SAMBA_PAI namelen 9 valuelen 1024 hashval > 0x487716dd op_flags > <...>-18953 [004] ...1 466290.750744: xfs_getattr: dev 253:78 ino > 0x1749833903 > <...>-18953 [004] ...1 466290.750791: xfs_attr_leaf_addname: dev > 253:78 ino 0x1749833903 name NTACL namelen 5 valuelen 444 hashval 0xea9061c8 > op_flags ADDNAME|OKNOENT > <...>-18953 [004] ...1 466290.750793: xfs_attr_leaf_lookup: dev > 253:78 ino 0x1749833903 name NTACL namelen 5 valuelen 444 hashval 0xea9061c8 > op_flags ADDNAME|OKNOENT > <...>-18953 [004] ...1 466290.750794: xfs_attr_leaf_add: dev > 253:78 ino 0x1749833903 name NTACL namelen 5 valuelen 444 hashval 0xea9061c8 > op_flags ADDNAME|OKNOENT > <...>-18953 [004] ...1 466290.750794: xfs_attr_leaf_add_work: dev > 253:78 ino 0x1749833903 name NTACL namelen 5 valuelen 444 hashval 0xea9061c8 > op_flags ADDNAME|OKNOENT > <...>-18953 [004] ...1 466290.751029: xfs_attr_leaf_get: dev > 253:78 ino 0x1749833903 name NTACL namelen 5 valuelen 1024 hashval > 0xea9061c8 op_flags > <...>-18953 [004] ...1 466290.751031: xfs_attr_leaf_lookup: dev > 253:78 ino 0x1749833903 name NTACL namelen 5 valuelen 1024 hashval > 0xea9061c8 op_flags > > > Any pointers/input to check this further is greatly appreciated! Thanks. > > --Shyam > > On Tue, Apr 26, 2016 at 4:10 PM, Shyam Kaushik <shyam@xxxxxxxxxxxxxxxxx> > wrote: > Hi Eric, > > Yes this is a non-CRC FS. The workload is we a custom application that > does several (16 threads) file create/read/writes/update xattrs on a XFS > mounted over raw disk. > > There is no messages prior to this & we hit this once in a few days (like > every 4-5 days). I will try your suggestion of running with xfs_attr_* > tracepoints all the time so that when we hit this error we have better > info. Please let me know if you prefer to add further debug prints. > Thanks. > > --Shyam > > > Eric Sandeen sandeen at sandeen.net wrote: > > On Mon Apr 25 16:57:11 CDT 2016 > > > > Is this a non-crc filesystem? > > > > > > XFS_ATTR_LEAF_MAGIC is ok (if it's a non-crc filesystem) > > > > Looks the same as the other report, tripping on: > > > > if (ichdr.count == 0) > > return false; > > > > A reproducer would be super here. At least maybe a description of the > > workload that hits it? > > > > Count is manipulated in things like attr leaf compaction... > > Any other messages prior to this? > > > > How often do you hit it? > > > > You could also turn on xfs_attr_* tracepoints, maybe. > > > > -Eric > > -----Original Message----- > From: Shyam Kaushik [mailto:shyam@xxxxxxxxxxxxxxxxx] > Sent: 25 April 2016 22:22 > To: 'xfs@xxxxxxxxxxx' > Cc: Alex Lyakas > Subject: Metadata corruption detected at > xfs_attr3_leaf_write_verify+0xe5/0x100 > > Hi Dave et al, > > We are periodically hitting the below metadata corruption with XFS over a > raw disk running several file copies with xattr operations on kernel > 3.18.19. Unmounting & running xfs_repair doesn't report any corruption. I > see that this was last reported here > http://oss.sgi.com/archives/xfs/2015-12/msg00224.html > > Unfortunately we dont have a reproducer, but this issue happens > periodically. We can add more debug prints & allow this issue to happen > again. Can you pls suggest any options to debug this further? Thanks > > Apr 20 21:58:03 node1 kernel: [16736.286370] XFS (dm-26): Metadata > corruption detected at xfs_attr3_leaf_write_verify+0xe5/0x100 [xfs], block > 0x19c5c728 > Apr 20 21:58:03 node1 kernel: [16736.289084] XFS (dm-26): Unmount and run > xfs_repair > Apr 20 21:58:03 node1 kernel: [16736.290257] XFS (dm-26): First 64 bytes > of corrupted metadata buffer: > Apr 20 21:58:03 node1 kernel: [16736.291797] ffff880123668000: 00 00 00 00 > 00 00 00 00 fb ee 00 00 00 00 00 00 ................ > Apr 20 21:58:03 node1 kernel: [16736.293823] ffff880123668010: 10 00 00 00 > 00 20 0f e0 00 00 00 00 00 00 00 00 ..... .......... > Apr 20 21:58:03 node1 kernel: [16736.297504] ffff880123668020: 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 ................ > Apr 20 21:58:03 node1 kernel: [16736.299343] ffff880123668030: 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 ................ > Apr 20 21:58:03 node1 kernel: [16736.301465] XFS (dm-26): > xfs_do_force_shutdown(0x8) called from line 1244 of file fs/xfs/xfs_buf.c. > Return address = 0xffffffffc095cee0 > Apr 20 21:58:03 node1 kernel: [16736.301469] ------------[ cut here > ]------------ > Apr 20 21:58:03 node1 kernel: [16736.301551] XFS(dm-26): SHUTDOWN!!! > old_flags=0x0 new_flags=0x8 > Apr 20 21:58:03 node1 kernel: [16736.301703] CPU: 1 PID: 7857 Comm: > xfsaild/dm-26 Tainted: G OE 3.18.19 #1 > Apr 20 21:58:03 node1 kernel: [16736.301705] Hardware name: Bochs Bochs, > BIOS Bochs 01/01/2011 > Apr 20 21:58:03 node1 kernel: [16736.301707] 0000000000000009 > ffff88020c5ffb38 ffffffff81710c85 0000000000000000 > Apr 20 21:58:03 node1 kernel: [16736.301711] ffff88020c5ffb88 > ffff88020c5ffb78 ffffffff81072df1 2e2e202030302030 > Apr 20 21:58:03 node1 kernel: [16736.301715] 0000000000000000 > 0000000000000008 ffff88020c127000 0000000000000000 > Apr 20 21:58:03 node1 kernel: [16736.301718] Call Trace: > Apr 20 21:58:03 node1 kernel: [16736.301769] [<ffffffff81710c85>] > dump_stack+0x4e/0x71 > Apr 20 21:58:03 node1 kernel: [16736.301780] [<ffffffff81072df1>] > warn_slowpath_common+0x81/0xa0 > Apr 20 21:58:03 node1 kernel: [16736.301784] [<ffffffff81072e56>] > warn_slowpath_fmt+0x46/0x50 > Apr 20 21:58:03 node1 kernel: [16736.301860] [<ffffffffc09693f3>] > xfs_do_force_shutdown+0x33/0x170 [xfs] > Apr 20 21:58:03 node1 kernel: [16736.301921] [<ffffffffc095cee0>] ? > _xfs_buf_ioapply+0xa0/0x430 [xfs] > Apr 20 21:58:03 node1 kernel: [16736.301951] [<ffffffffc095ee4b>] ? > __xfs_buf_delwri_submit+0x22b/0x290 [xfs] > Apr 20 21:58:03 node1 kernel: [16736.302066] [<ffffffffc095cee0>] > _xfs_buf_ioapply+0xa0/0x430 [xfs] > Apr 20 21:58:03 node1 kernel: [16736.302137] [<ffffffff8109e260>] ? > wake_up_state+0x20/0x20 > Apr 20 21:58:03 node1 kernel: [16736.302162] [<ffffffffc095ee4b>] ? > __xfs_buf_delwri_submit+0x22b/0x290 [xfs] > Apr 20 21:58:03 node1 kernel: [16736.302183] [<ffffffffc095ea78>] > xfs_buf_submit+0x68/0x210 [xfs] > Apr 20 21:58:03 node1 kernel: [16736.302241] [<ffffffffc095ee4b>] > __xfs_buf_delwri_submit+0x22b/0x290 [xfs] > Apr 20 21:58:03 node1 kernel: [16736.302267] [<ffffffffc095fc60>] ? > xfs_buf_delwri_submit_nowait+0x20/0x30 [xfs] > Apr 20 21:58:03 node1 kernel: [16736.302291] [<ffffffffc098f440>] ? > xfs_trans_ail_cursor_first+0x90/0x90 [xfs] > Apr 20 21:58:03 node1 kernel: [16736.302308] [<ffffffffc095fc60>] > xfs_buf_delwri_submit_nowait+0x20/0x30 [xfs] > Apr 20 21:58:03 node1 kernel: [16736.302333] [<ffffffffc098f66b>] > xfsaild+0x22b/0x630 [xfs] > Apr 20 21:58:03 node1 kernel: [16736.302513] [<ffffffffc098f440>] ? > xfs_trans_ail_cursor_first+0x90/0x90 [xfs] > Apr 20 21:58:03 node1 kernel: [16736.302518] [<ffffffff810911b9>] > kthread+0xc9/0xe0 > Apr 20 21:58:03 node1 kernel: [16736.302522] [<ffffffff810910f0>] ? > kthread_create_on_node+0x180/0x180 > Apr 20 21:58:03 node1 kernel: [16736.302530] [<ffffffff81717918>] > ret_from_fork+0x58/0x90 > Apr 20 21:58:03 node1 kernel: [16736.302549] [<ffffffff810910f0>] ? > kthread_create_on_node+0x180/0x180 > Apr 20 21:58:03 node1 kernel: [16736.302551] ---[ end trace > 0bb81b88fdd6a298 ]--- > > --Shyam > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html