On Thu, Jun 23, 2022 at 02:52:22PM -0500, Clay Gerrard wrote: > I work on an object storage system, OpenStack Swift, that has always > used xfs on the storage nodes. Our system has encountered many > various disk failures and occasionally apparent file system corruption > over the years, but we've been noticing something lately that might be > "new" and I'm considering how to approach the problem. I'm interested > to solicit critique on my current thinking/process - particularly from > xfs experts. > > [root@s8k-sjc3-d01-obj-9 ~]# xfs_bmap > /srv/node/d21865/quarantined/objects-1/e53/f0418758de4baaa402eb301c5bae3e53 > /srv/node/d21865/quarantined/objects-1/e53/f0418758de4baaa402eb301c5bae3e53: > No data available > [root@s8k-sjc3-d01-obj-9 ~]# xfs_db > /srv/node/d21865/quarantined/objects-1/e53/f0418758de4baaa402eb301c5bae3e53 > /srv/node/d21865/quarantined/objects-1/e53/f0418758de4baaa402eb301c5bae3e53: > No data available ENODATA implies that it's trying to access an xattr that doesn't exist. > fatal error -- couldn't initialize XFS library > [root@s8k-sjc3-d01-obj-9 ~]# ls -alhF /srv/node/d21865/quarantined/objects-1/e53 > ls: cannot access > /srv/node/d21865/quarantined/objects-1/e53/f0418758de4baaa402eb301c5bae3e53: > No data available > total 4.0K > drwxr-xr-x 9 swift swift 318 Jun 7 00:57 ./ > drwxr-xr-x 33 swift swift 4.0K Jun 23 16:10 ../ > d????????? ? ? ? ? ? f0418758de4baaa402eb301c5bae3e53/ That's the typical ls output when it couldn't stat() an inode. This typically occurs when the inode has been corrupted. On XFS, at least, this should result in a corruption warning in the kernel log. Did you check dmesg for errors? > drwxr-xr-x 2 swift swift 47 May 27 00:43 f04193c31edc9593007471ee5a189e53/ > drwxr-xr-x 2 swift swift 47 May 27 00:43 f0419c711a5a5d01dac6154970525e53/ > drwxr-xr-x 2 swift swift 47 May 27 00:43 f041a2548b9255493d16ba21c19b6e53/ > drwxr-xr-x 2 swift swift 47 Jun 7 00:57 f041aa09d40566d6915a706a22886e53/ > drwxr-xr-x 2 swift swift 39 May 27 00:43 f041ac88bf13e5458a049d827e761e53/ > drwxr-xr-x 2 swift swift 47 May 27 00:43 f041bfd1c234d44b591c025d459a7e53/ > [root@s8k-sjc3-d01-obj-9 ~]# python > Python 2.7.5 (default, Nov 16 2020, 22:23:17) > [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import os > >>> os.stat('/srv/node/d21865/quarantined/objects-1/e53/f0418758de4baaa402eb301c5bae3e53') > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > OSError: [Errno 61] No data available: > '/srv/node/d21865/quarantined/objects-1/e53/f0418758de4baaa402eb301c5bae3e53' > >>> os.listdir('/srv/node/d21865/quarantined/objects-1/e53/f0418758de4baaa402eb301c5bae3e53') > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > OSError: [Errno 61] No data available: > '/srv/node/d21865/quarantined/objects-1/e53/f0418758de4baaa402eb301c5bae3e53' > >>> Use strace, not the python debugger, to find what syscall returned the error. > [root@s8k-sjc3-d01-obj-9 ~]# uname -a > Linux s8k-sjc3-d01-obj-9.nsv.sjc3.nvmetal.net > 3.10.0-1160.62.1.el7.x86_64 #1 SMP Tue Apr 5 16:57:59 UTC 2022 x86_64 > x86_64 x86_64 GNU/Linux That's a RHEL7 kernel. Upstream developers really can't help you diagnose random weird problems with these kernels - they are completely custom kernels and so only the vendor can really help you with diagnosing to root cause of problems such as this. You should talk to your RH support contact. > I'd also like to be able to "simulate" this kind of corruption on a > healthy filesystem so we can test our "quarantine/auditor" code that's > trying to move these filesystem problems out of the way for the > consistency engine. Does anyone have any guess how I could MAKE an > xfs filesystem produce this kind of behavior on purpose? Use xfs_db to corrupt a directory inode, then try to read it. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx