On 10/8/18 9:03 AM, Arkadiusz Miśkiewicz wrote: > > Big fs, ton of small files, repair takes 36h until this happens: > > rebuilding directory inode 30363993060 > rebuilding directory inode 30398868604 > rebuilding directory inode 30414474627 > rebuilding directory inode 30425006954 > rebuilding directory inode 30447937553 > rebuilding directory inode 30529556616 > rebuilding directory inode 30537494728 > rebuilding directory inode 30569826838 > rebuilding directory inode 31060721895 > Metadata corruption detected at 0x41f9db, inode 0x73b5d00e7 data fork > xfs_repair: warning - iflush_int failed (-117) > Warning: recursive buffer locking at block 31060721776 detected > Metadata corruption detected at 0x41f9db, inode 0x73b5d00e7 data fork > xfs_repair: warning - iflush_int failed (-117) > Warning: recursive buffer locking at block 31060721776 detected > Metadata corruption detected at 0x41f980, inode 0x73b5d00e7 data fork > xfs_repair: warning - iflush_int failed (-117) > realloc(): invalid next size > Aborted > > > Fails somewhere in 0x41f9db <xfs_dir2_sf_verify+603> > > Complete log at > https://ixion.pld-linux.org/~arekm/xfs-1/repair.txt > > Test was done with xfs_repair 4.17.0 and 4.18.0 with the same result. > > kernel 4.18.5 > > Running under gdb now. > > Any ideas? With such a big fs it's tough to share a metadump for a reproducer, I assume. The earlier write verifiers failing for xfs_repair writes are troubling... I'm not certain why it's rebuilding so many dir inodes; there are several cases where that happens, but unfortunately repair doesn't always say which one or why. Anyway, you eventually get to this inode (it's the same in decimal & hex below): rebuilding directory inode 360732305 Metadata corruption detected at 0x41f9db, inode 0x15805691 data fork xfs_repair: warning - iflush_int failed (-117) with lots of corruption during the writes, and this happens for a couple other inodes, until finally: rebuilding directory inode 31060721895 Metadata corruption detected at 0x41f9db, inode 0x73b5d00e7 data fork and this one ends up aborting in glibc's realloc(): realloc(): invalid next size I /think/ that this indicates that memory has been corrupted during the repair run. :/ Running under valgrind would probably lead to a 72hr runtime or more :) I wonder if it would save time in the long run to make a metadump and remove all directory trees other than this inode (360732305) from it, and see if the same failure occurs when running on the reduced fs image? Out of curiosity, what happened to this filesystem to leave it in bad shape? -Eric