Re: XFS File system in trouble

Leslie Rhorer <lrhorer@xxxxxxxxxxxx> · Tue, 28 Jul 2015 10:13:01 -0500

On 7/28/2015 7:33 AM, Brian Foster wrote:
On Tue, Jul 28, 2015 at 02:46:45AM -0500, Leslie Rhorer wrote:
On 7/20/2015 6:17 AM, Brian Foster wrote:
On Sat, Jul 18, 2015 at 08:02:50PM -0500, Leslie Rhorer wrote:

	I found the problem with md5sum (and probably nfs, as well).  One of the
memory modules in the server was bad.  The problem with XFS persists.  Every
time tar tried to create the directory:

/RAID/Server-Main/Equipment/Drive Controllers/HighPoint Adapters/Rocket 2722/Driver/RR276x/Driver/Linux/openSUSE/rr276x-suse-11.2-i386/linux/suse/i386-11.1

	It would begin spitting out errors, starting with "Cannot mkdir: Structure
needs cleaning".  At that point, XFS had shut down.  I went into
/RAID/Server-Main/Equipment/Drive Controllers/HighPoint Adapters/Rocket
2722/Driver/RR276x/Driver/Linux/openSUSE/rr276x-suse-11.2-i386/linux/suse/
and created the i386-11.1 directory by hand, and tar no longer starts
spitting out errors at that point, but it does start up again at
RR2782/Windows/Vista-Win2008-Win7-legacy_single/x64.

So is this untar problem a reliable reproducer? If so, here's what I

	The processes I was running this weekend ran longer than expected, and in
fact were still running just a couple of hours ago.  I was doing an rsync
with CRC check from the backup system to the one with the problem.  There
were a few corrupt files, but not a huge number.  Although slower than I
hoped, everything was running fine until a short time ago, when rsync
encountered the very same issue I keep having with tar, which is to say it
tried to create a directory and the file system crashed with precisely the
same symptoms as when tar was failing.

would try to hopefully isolate a filesystem problem from something
underneath:

xfs_metadump -go /dev/md0 /somewhere/on/rootfs/md0.metadump
xfs_mdrestore -g /somewhere/on/rootfs/md0.metadump /.../fileonrootfs.img
mount /.../fileonrootfs.img /mnt/

	I tried to do the xfs_mdrestore to the root file system, but it fails:

RAID-Server:/TEST# xfs_mdrestore -g md0.metadump RAIDfile.img
xfs_mdrestore: cannot set filesystem image size: File too large

Hmm, I guess the file size exceeds the capabilities of the root fs, even
if there might ultimately be enough space to restore the metadump.

	I wouldn't think so, at least not fundamentally.  It's ext4.  It's 
certainly not big enough to hold an 18T file system, though, and perhaps 
that is what xfs_restore is checking.

	So then I did the same thing to a directory on an nfs mount from another
machine.  That worked.  I then went to the other machine, mounted the image
on /media, copied the tarball to the location on the mount where the tarball
resides on the real array, dn ran the tar job. It completed without errors.

That's interesting. It tells us the fs apparently isn't fundamentally
broken, but the separate machine potentially introduces a different
kernel. Is that the case here? What else is different between these
systems?

	Not much.  Both are running kernel 3.0.16-4.  Both have 24T mdadm RAID 
Arrays with similar properties (there may be some differences in chunk 
size, etc).  Right now both have the same motherboard and the same drive 
controllers.  All 16 drives reside in a single RAID chassis.

	I then created the image on the array where the tasks are failing and
attempted to mount it to /media on the problematic machine.  That fails
with:

RAID-Server:/TEST# mount /RAID/TEST/RAIDfile.img /media/
mount: wrong fs type, bad option, bad superblock on /dev/loop0,
        missing codepage or helper program, or other error

        In some cases useful info is found in syslog - try
        dmesg | tail or so.

	The problem is this (from syslog):
Jul 28 01:53:48 RAID-Server kernel: [431155.847523] loop: module loaded
Jul 28 01:53:48 RAID-Server kernel: [431155.927238] XFS (loop0): Filesystem
has duplicate UUID 228cfaa7-ae6b-44fc-b703-1c32385231c0 - can't mount
Jul 28 01:55:51 RAID-Server kernel: [431278.916490] XFS (loop0): Filesystem
has duplicate UUID 228cfaa7-ae6b-44fc-b703-1c32385231c0 - can't mount

	Presumably it has the same UUID as the RAID array because it is expected to
do so.  I can't mount it unless I umount the RAID array, but if I do that, I
can't get to the file to mount the dump image, since it is on the array.

Ok, somebody already replied with how to get around this. That said, it
sounds like you've restored the metadump to an image file on the
problematic fs.

	I had no other option.  I suppose I could attach an external drive and 
restore to it.  I'll try that tonight, but if xfs_restore refuses to 
write to a volume whose raw storage capacity is less than the putative 
size of the original image, then that is also likely to fail.  I don't 
have a way to create another 24T storage system at hand.

I'm not sure how useful a test that is since we're
testing on the same hardware. I suppose it could be interesting if the
storage hardware is similar with the alternate machine referenced above.

	Almost identical.  The alternate machine serves as a backup system with 
the data on the arrays synchronized by rsync every morning.  The ailing 
system runs more services than the backup, and the backup runs a couple 
the primary does not, but otherwise they are nearly mirrors of each 
other.  The hardware is identical.

For example, if you restore here and the test does not fail, the test on
the separate machine is probably less informative.

	I then copied both the tarball and the image over to the root, and while
the system would not let me create the image on the root, it did let me copy
the image to the root.  I then umounted the RAID array, mounted the image,
and attempted to cd to the original directory in the image mount where the
tarball was saved.  That failed with an I/O error:

It sounds a bit strange for the mdrestore to fail on root but a cp of
the resulting image to work. Do the resulting images have the same file
size or is the rootfs copy truncated? If the latter, you could be
missing part of the fs and thus any of the following tests are probably
moot.

	Well, it can't be as large as it is reported, let's put it that way, 
although the reported file size is the same.  Ls claims it to be 16T in 
size, which cannot be the case on a 100G partition.  I forgot to mention 
cp does complain:

RAID-Server:/# cp /RAID/TEST/RAIDfile.img ./
cp: cannot lseek ‘./RAIDfile.img’: Invalid argument

	But it does the same thing on the backup server, and it works there.  I 
tried a cmp, and it seems to be hung.  It just may be taking a long 
time, however.

Brian

RAID-Server:/# cd "/media/Server-Main/Equipment/Drive Controllers/HighPoint
Adapters/Rocket 2722/Driver/"
bash: cd: /media/Server-Main/Equipment/Drive Controllers/HighPoint
Adapters/Rocket 2722/Driver/: Input/output error

	I changed directories to a point two directories above the previous attempt
and did a long listing:

RAID-Server:/# cd "/media/Server-Main/Equipment/Drive Controllers/HighPoint
Adapters"
RAID-Server:/media/Server-Main/Equipment/Drive Controllers/HighPoint
Adapters# ll
ls: cannot access RocketRAID 2722: Input/output error
total 4
drwxr-xr-x 6 root lrhorer 4096 Jul 18 19:26 Rocket 2722
?????????? ? ?    ?          ?            ? RocketRAID 2722

	As you can see, Rocket 2722 is still there, but RocketRAID 2722 is very
sick.  Rocket 2722 is the parent of where the tarbal was, however, so I did
a cd and an ll again:

RAID-Server:/media/Server-Main/Equipment/Drive Controllers/HighPoint
Adapters# cd "Rocket 2722"/
RAID-Server:/media/Server-Main/Equipment/Drive Controllers/HighPoint
Adapters/Rocket 2722# ll
ls: cannot access BIOS: Input/output error
ls: cannot access Driver: Input/output error
ls: cannot access HighPoint RAID Management Software: Input/output error
ls: cannot access Manual: Input/output error
total 248
-rwxr--r-- 1 root lrhorer 245760 Nov 20  2008 autorun.exe
-rwxr--r-- 1 root lrhorer     51 Mar 21  2001 autorun.inf
?????????? ? ?    ?            ?            ? BIOS
?????????? ? ?    ?            ?            ? Driver
?????????? ? ?    ?            ?            ? HighPoint RAID Management
Software
?????????? ? ?    ?            ?            ? Manual
-rwxr--r-- 1 root lrhorer   1134 Feb  5  2012 readme.txt

	So now, what?

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs