We have now hit the below stack trace or a very similar stack trace roughly 6 times in our mesos clusters. My best guess given code analysis is that we are unable to allocate a new node in the allocation group btree free-list (*or something much weirder). There is plenty of ram and "space" left on the filesystem at this point though. vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvkernel: XFS (dm-4): Internal error XFS_WANT_CORRUPTED_GOTO at line 3505 of file fs/xfs/libxfs/xfs_btree.c. Caller xfs_free_ag_extent+0x35d/0x7a0 [xfs] kernel: CPU: 18 PID: 9896 Comm: mesos-slave Not tainted 4.10.10-1.el7.elrepo.x86_64 #1 kernel: Hardware name: Supermicro PIO-618U-TR4T+-ST031/X10DRU-i+, BIOS 2.0 12/17/2015 kernel: Call Trace: kernel: dump_stack+0x63/0x87 kernel: xfs_error_report+0x3b/0x40 [xfs] kernel: ? xfs_free_ag_extent+0x35d/0x7a0 [xfs] kernel: xfs_btree_insert+0x1b0/0x1c0 [xfs] kernel: xfs_free_ag_extent+0x35d/0x7a0 [xfs] kernel: xfs_free_extent+0xbb/0x150 [xfs] kernel: xfs_trans_free_extent+0x4f/0x110 [xfs] kernel: ? xfs_trans_add_item+0x5d/0x90 [xfs] kernel: xfs_extent_free_finish_item+0x26/0x40 [xfs] kernel: xfs_defer_finish+0x149/0x410 [xfs] kernel: xfs_remove+0x281/0x330 [xfs] kernel: xfs_vn_unlink+0x55/0xa0 [xfs] kernel: vfs_rmdir+0xb6/0x130 kernel: do_rmdir+0x1b3/0x1d0 kernel: SyS_rmdir+0x16/0x20 kernel: do_syscall_64+0x67/0x180 kernel: entry_SYSCALL64_slow_path+0x25/0x25 kernel: RIP: 0033:0x7f85d8d92397 kernel: RSP: 002b:00007f85cef9b758 EFLAGS: 00000246 ORIG_RAX: 0000000000000054 kernel: RAX: ffffffffffffffda RBX: 00007f858c00b4c0 RCX: 00007f85d8d92397 kernel: RDX: 00007f858c09ad70 RSI: 0000000000000000 RDI: 00007f858c09ad70 kernel: RBP: 00007f85cef9bc30 R08: 0000000000000001 R09: 0000000000000002 kernel: R10: 0000006f74656c67 R11: 0000000000000246 R12: 00007f85cef9c640 kernel: R13: 00007f85cef9bc50 R14: 00007f85cef9bcc0 R15: 00007f85cef9bc40 kernel: XFS (dm-4): xfs_do_force_shutdown(0x8) called from line 236 of file fs/xfs/libxfs/xfs_defer.c. Return address = 0xffffffffa028f087 kernel: XFS (dm-4): Corruption of in-memory data detected. Shutting down filesystem ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Attempts to unmount and repair the filesystem also fail, but the error from the above trace was accidentally lost when the machine got re-installed. I found this thread https://www.centos.org/forums/viewtopic.php?t=15898#p75290 about someone hitting something similar. It was only similar in-so-much as it was an XFS_WANT_CORRUPTED_GOTO and he had a ton of allocation groups. So I checked our allocation group count and discovered it to be 1729 vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv $ xfs_info /dev/mapper/rootvg-var_lv meta-data=/dev/mapper/rootvg-var_lv isize=512 agcount=1729, agsize=163776 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=0 spinodes=0 data = bsize=4096 blocks=283115520, imaxpct=25 = sunit=64 swidth=64 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal bsize=4096 blocks=2560, version=2 = sectsz=512 sunit=64 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This high agcount is due to the fact we deploy all of our nodes with a script, and then xfs_growfs the filesystem to the usable amount of space from there *(like pretty much every major automated deployment). So my questions are. 1. Has the above stack trace been seen before or solved? I could not find any commits to that effect 2. Could this issue be the result our high number of allocation groups? 3. What is the best way to deploy xfs when we know we will be immediately growing the filesystem? 4. If this is all due to the high number of allocation groups, shouldn't xfs_growfs at least warn when growing would result in a ridiculous number of allocation groups? Thank you, Dave Chiluk -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html