I work on an object storage system, OpenStack Swift, that has always used xfs on the storage nodes. Our system has encountered many various disk failures and occasionally apparent file system corruption over the years, but we've been noticing something lately that might be "new" and I'm considering how to approach the problem. I'm interested to solicit critique on my current thinking/process - particularly from xfs experts. [root@s8k-sjc3-d01-obj-9 ~]# xfs_bmap /srv/node/d21865/quarantined/objects-1/e53/f0418758de4baaa402eb301c5bae3e53 /srv/node/d21865/quarantined/objects-1/e53/f0418758de4baaa402eb301c5bae3e53: No data available [root@s8k-sjc3-d01-obj-9 ~]# xfs_db /srv/node/d21865/quarantined/objects-1/e53/f0418758de4baaa402eb301c5bae3e53 /srv/node/d21865/quarantined/objects-1/e53/f0418758de4baaa402eb301c5bae3e53: No data available fatal error -- couldn't initialize XFS library [root@s8k-sjc3-d01-obj-9 ~]# ls -alhF /srv/node/d21865/quarantined/objects-1/e53 ls: cannot access /srv/node/d21865/quarantined/objects-1/e53/f0418758de4baaa402eb301c5bae3e53: No data available total 4.0K drwxr-xr-x 9 swift swift 318 Jun 7 00:57 ./ drwxr-xr-x 33 swift swift 4.0K Jun 23 16:10 ../ d????????? ? ? ? ? ? f0418758de4baaa402eb301c5bae3e53/ drwxr-xr-x 2 swift swift 47 May 27 00:43 f04193c31edc9593007471ee5a189e53/ drwxr-xr-x 2 swift swift 47 May 27 00:43 f0419c711a5a5d01dac6154970525e53/ drwxr-xr-x 2 swift swift 47 May 27 00:43 f041a2548b9255493d16ba21c19b6e53/ drwxr-xr-x 2 swift swift 47 Jun 7 00:57 f041aa09d40566d6915a706a22886e53/ drwxr-xr-x 2 swift swift 39 May 27 00:43 f041ac88bf13e5458a049d827e761e53/ drwxr-xr-x 2 swift swift 47 May 27 00:43 f041bfd1c234d44b591c025d459a7e53/ [root@s8k-sjc3-d01-obj-9 ~]# python Python 2.7.5 (default, Nov 16 2020, 22:23:17) [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import os >>> os.stat('/srv/node/d21865/quarantined/objects-1/e53/f0418758de4baaa402eb301c5bae3e53') Traceback (most recent call last): File "<stdin>", line 1, in <module> OSError: [Errno 61] No data available: '/srv/node/d21865/quarantined/objects-1/e53/f0418758de4baaa402eb301c5bae3e53' >>> os.listdir('/srv/node/d21865/quarantined/objects-1/e53/f0418758de4baaa402eb301c5bae3e53') Traceback (most recent call last): File "<stdin>", line 1, in <module> OSError: [Errno 61] No data available: '/srv/node/d21865/quarantined/objects-1/e53/f0418758de4baaa402eb301c5bae3e53' >>> [root@s8k-sjc3-d01-obj-9 ~]# uname -a Linux s8k-sjc3-d01-obj-9.nsv.sjc3.nvmetal.net 3.10.0-1160.62.1.el7.x86_64 #1 SMP Tue Apr 5 16:57:59 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux [root@s8k-sjc3-d01-obj-9 ~]# mount | grep /srv/node/d21865 /dev/sdd on /srv/node/d21865 type xfs (rw,noatime,nodiratime,attr2,inode64,logbufs=8,noquota) [root@s8k-sjc3-d01-obj-9 ~]# xfs_db -r /dev/sdd xfs_db> version versionnum [0xbcb5+0x18a] = V5,NLINK,DIRV2,ATTR,ALIGN,LOGV2,EXTFLG,SECTOR,MOREBITS,ATTR2,LAZYSBCOUNT,PROJID32BIT,CRC,FTYPE xfs_db> We can't "do" anything with the directory once it starts giving us ENODATA. We don't typically like to unmount the whole filesystem (there's a LOT of *uncorrupt* data on the device) - so I'm not 100% sure if xfs_repair fixes these directories. Swift itself is a replicated/erasure-coded store - we can almost always "throw away" corrupt data on a single node and the rest can bring the cluster state back to full durability. This particular failure is worrisome for two reasons: 1) we can't "just" delete the affected directory - because we can't stat/move it - so we have to throw away the whole *parent* directory (HUGE blast radius in some cases) 2) for 10 years running Swift i've never seen this exactly, and now it seems to be happening more and more often - but we don't know if it's a new software version, or new hardware revision, or new access pattern I'd also like to be able to "simulate" this kind of corruption on a healthy filesystem so we can test our "quarantine/auditor" code that's trying to move these filesystem problems out of the way for the consistency engine. Does anyone have any guess how I could MAKE an xfs filesystem produce this kind of behavior on purpose? -- Clay Gerrard