On Monday 30 of June 2014, Dave Chinner wrote: > [Compendium reply to all 3 emails] > > On Sat, Jun 28, 2014 at 01:41:54AM +0200, Arkadiusz Miśkiewicz wrote: > > Hello. > > > > I have a fs (metadump of it > > http://ixion.pld-linux.org/~arekm/p2/x1/web2-home.metadump.gz) > > that xfs_repair 3.2.0 is unable to fix properly. > > > > Running xfs_repair few times shows the same errors repeating: > > http://ixion.pld-linux.org/~arekm/p2/x1/repair2.txt > > http://ixion.pld-linux.org/~arekm/p2/x1/repair3.txt > > http://ixion.pld-linux.org/~arekm/p2/x1/repair4.txt > > http://ixion.pld-linux.org/~arekm/p2/x1/repair5.txt > > > > (repair1.txt also exists - it was initial, very big/long repair) > > > > Note that fs mounts fine (and was mounting fine before and after repair) > > but xfs_repair indicates that not everything got fixed. > > > > > > Unfortunately there looks to be a problem with metadump image. xfs_repair > > is able to finish fixing on a restored image but is not able (see > > repairX.txt) above on real devices. Huh? > > > > Examples of problems repeating each time xfs_repair is run: > > > > 1) > > reset bad sb for ag 5 > > > >. non-null group quota inode field in superblock 7 > > OK, so this is indicative of something screwed up a long time ago. > Firstly, the primary superblocks shows: > > uquotino = 4077961 > gquotino = 0 > qflags = 0 > > i.e. user quota @ inode 4077961, no group quota. The secondary > superblocks that are being warned about show: > > uquotino = 0 > gquotino = 4077962 > qflags = 0 > > Which is clearly wrong. They should have been overwritten during the > growfs operation to match the primary superblock. > > The similarity in inode number leads me to beleive at some point > both user and group/project quotas were enabled on this filesystem, Both user and project quotas were enabled on this fs for last few years. > but right now only user quotas are enabled. It's only AGs 1-15 that > show this, so this seems to me that it is likely that this > filesystem was originally only 16 AGs and it's been grown many times > since? The quotas was running fine until some repair run (ie. before and after first repair mounting with quota succeeded) - some xfs_repair run later broke this. > > Oh, this all occurred because you had a growfs operation on 3.10 > fail because of garbage in the the sb of AG 16 (i.e. this from IRC: > http://sprunge.us/UJFE)? IOWs, this commit: > > 9802182 xfs: verify superblocks as they are read from disk > > tripped up on sb 16. That means sb 16 is was not modified by the > growfs operation, and so should have the pre-growfs information in > it: > > uquotino = 4077961 > gquotino = 4077962 > qflags = 0x77 > > Yeah, that's what I thought - the previous grow operation had both > quotas enabled. OK, that explains why the growfs operation had > issues, but it doesn't explain exactly how the quota inodes got > screwed up like that. The fs had working quota when having 3 digit number of AGs. I wouldn't blame growfs failure to be related to quota brokeness. IMO some repair broke this (or tried fixing and broke). > Anyway, the growfs issues were solved by: > > 10e6e65 xfs: be more forgiving of a v4 secondary sb w/ junk in v5 fields > > which landed in 3.13. Ok. > > > 2) > > correcting nblocks for inode 965195858, was 19 - counted 20 > > correcting nextents for inode 965195858, was 16 - counted 17 > > Which is preceeded by: > > data fork in ino 965195858 claims free block 60323539 > data fork in ino 965195858 claims free block 60323532 > > and when combined with the later: > > entry "dsc0945153ac18d4d4f1a-150x150.jpg" (ino 967349800) in dir 965195858 > is a duplicate name, marking entry to be junked > > errors from that directory, it looks like the space was freed but > the directory btree not correctly updated. No idea what might have > caused that, but it is a classic symptom of volatile write caches... > > Hmmm, and when It goes to junk them on my local testing: > > rebuilding directory inode 965195858 > name create failed in ino 965195858 (117), filesystem may be out of space > > Which is an EFSCORRUPTED error trying to rebuild that directory. > The second error pass did not throw an error, but it did not fix > the errors as a 3rd pass still reported this. I'll look into why. > > > 3) clearing some entries; moving to lost+found (the same files) > > > > > > 4) > > Phase 7 - verify and correct link counts... > > Invalid inode number 0xfeffffffffffffff > > xfs_dir_ino_validate: XFS_ERROR_REPORT > > Metadata corruption detected at block 0x11fbb698/0x1000 > > libxfs_writebufr: write verifer failed on bno 0x11fbb698/0x1000 > > Invalid inode number 0xfeffffffffffffff > > xfs_dir_ino_validate: XFS_ERROR_REPORT > > Metadata corruption detected at block 0x11fbb698/0x1000 > > libxfs_writebufr: write verifer failed on bno 0x11fbb698/0x1000 > > done > > Not sure what that is yet, but it looks like writing a directory > block found entries with invalid inode numbers in it. i.e. it's > telling me that there's something not been fixed up. > > I'm actually seeing this in phase4: > > - agno = 148 > Invalid inode number 0xfeffffffffffffff > xfs_dir_ino_validate: XFS_ERROR_REPORT > Metadata corruption detected at block 0x11fbb698/0x1000 > libxfs_writebufr: write verifer failed on bno 0x11fbb698/0x1000 > > Second time around, this does not happen, so the error has been > corrected in a later phase of the first pass. Here on two runs I got exactly the same report: Phase 7 - verify and correct link counts... Invalid inode number 0xfeffffffffffffff xfs_dir_ino_validate: XFS_ERROR_REPORT Metadata corruption detected at block 0x11fbb698/0x1000 libxfs_writebufr: write verifer failed on bno 0x11fbb698/0x1000 Invalid inode number 0xfeffffffffffffff xfs_dir_ino_validate: XFS_ERROR_REPORT Metadata corruption detected at block 0x11fbb698/0x1000 libxfs_writebufr: write verifer failed on bno 0x11fbb698/0x1000 but there were more of errors like this earlier so repair fixed some but left with these two. > > > 5)Metadata CRC error detected at block 0x0/0x200 > > but it is not CRC enabled fs > > That's typically caused by junk in the superblock beyond the end > of the v4 superblock structure. It should be followed by "zeroing > junk ..." Shouldn't repair fix superblocks when noticing v4 fs? I mean 3.2.0 repair reports: $ xfs_repair -v ./1t-image Phase 1 - find and verify superblock... - reporting progress in intervals of 15 minutes - block cache size set to 748144 entries Phase 2 - using internal log - zero log... zero_log: head block 2 tail block 2 - scan filesystem freespace and inode maps... Metadata CRC error detected at block 0x0/0x200 zeroing unused portion of primary superblock (AG #0) - 07:20:11: scanning filesystem freespace - 391 of 391 allocation groups done - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... - 07:20:11: scanning agi unlinked lists - 391 of 391 allocation groups done - process known inodes and perform inode discovery... - agno = 0 [...] but if I run 3.1.11 after running 3.2.0 then superblocks get fixed: $ ./xfsprogs/repair/xfs_repair -v ./1t-image Phase 1 - find and verify superblock... - block cache size set to 748144 entries Phase 2 - using internal log - zero log... zero_log: head block 2 tail block 2 - scan filesystem freespace and inode maps... zeroing unused portion of primary superblock (AG #0) zeroing unused portion of secondary superblock (AG #3) zeroing unused portion of secondary superblock (AG #1) zeroing unused portion of secondary superblock (AG #8) zeroing unused portion of secondary superblock (AG #2) zeroing unused portion of secondary superblock (AG #5) zeroing unused portion of secondary superblock (AG #6) zeroing unused portion of secondary superblock (AG #20) zeroing unused portion of secondary superblock (AG #9) zeroing unused portion of secondary superblock (AG #7) zeroing unused portion of secondary superblock (AG #12) zeroing unused portion of secondary superblock (AG #10) zeroing unused portion of secondary superblock (AG #13) zeroing unused portion of secondary superblock (AG #14) [...] zeroing unused portion of secondary superblock (AG #388) zeroing unused portion of secondary superblock (AG #363) - found root inode chunk Phase 3 - for each AG... Shouldn't these be "unused" for 3.2.0, too (since v4 fs) ? > > Made xfs metadump without file obfuscation and I'm able to reproduce the > > problem reliably on the image (if some xfs developer wants metadump image > > then please mail me - I don't want to put it for everyone due to obvious > > reasons). > > > > So additional bug in xfs_metadump where file obfuscation "fixes" some > > issues. Does it obfuscate but keep invalid conditions (like keeping "/" > > in file name) ? I guess it is not doing that. > > I doubt it handles a "/" in a file name properly - that's rather > illegal, and the obfuscation code probably doesn't handle it at all. Would be nice to keep these bad conditions. obfuscated metadump is behaving differently than non-obfuscated metadump with xfs_repair here (less issues with obfuscated than non-obfuscated), so obfuscation simply hides problems. I assume that you do testing on the non-obfuscated dump I gave on irc? > FWIW, xfs_repair will trash those files anyway: > > entry at block 22 offset 560 in directory inode 419558142 has illegal name > "/_198.jpg": clearing entry > > So regardless of whether metadump handles them or is not going to > change the fact that filenames with "/" them are broken.... > > But the real question here is how did you get "/" characters in > filenames? No idea. It could get corrupted many months/years ago. This fs has not seen repair for very long time (since there was no visible issues with it). > > [3571367.717167] XFS (loop0): Mounting Filesystem > > [3571367.883958] XFS (loop0): Ending clean mount > > [3571367.900733] XFS (loop0): Failed to initialize disk quotas. > > > > Files are accessible etc. Just no quota. Unfortunately no information why > > initialization failed. > > I can't tell why that's happening yet. I'm not sure what the correct > state is supposed to be yet (mount options will tell me) noatime,nodiratime,nodev,nosuid,usrquota,prjquota > so I'm not > sure what went wrong. As it is, you probaby should be upgrading to > a more recent kernel.... I can try to mount metadump image on newer kernel - will check and report back. > > So xfs_repair wasn't able to fix that, too. > > xfs_repair isn't detecting there is a problem because the uquotino > is not corrupt and the qflags is zero. Hence it doesn't do anything. > > More as I find it. > > Cheers, > > Dave. -- Arkadiusz Miśkiewicz, arekm / maven.pl _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs