Re: filesystem shrinks after using xfs_repair

Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> · Sun, 11 Jul 2010 05:56:15 -0500

Eli Morris put forth on 7/11/2010 1:32 AM:
>> Eli Morris put forth on 7/9/2010 6:07 PM:
>>> Hi All,
>>>
>>> I've got this problem where if I run xfs_repair, my filesystem shrinks by 11 TB, from a volume size of 62 TB to 51 TB. I can grow the filesystem again with xfs_growfs, but then rerunning xfs_repair shrinks it back down again. The first time this happened was a few days ago and running xfs_repair took about 7 TB of data with it. That is, out of the 11 TB of disk space that vanished, 7 TB had data on it, and 4 TB was empty space. XFS is running on top of an LVM volume. It's on an Intel/Linux system running Centos 5 (2.6.18-128.1.14.el5). Does anyone have an idea on what would cause such a thing and what I might try to keep it from continuing to happen. I could just never run xfs_repair again, but that doesn't seem like a good thing to count on. Major bonus points if anyone has any ideas on how to get my 7 TB of data back also. It must be there somewhere and it would be very bad to lose.
>>>
>>> thanks for any help and ideas. I'm just stumped right now.
>>
>> It may be helpful if you can provide more history (how long has this been
>> happening, recent upgrade?), the exact xfs_repair command line used, why you
>> were running xfs_repair in the first place, hardware or software RAID, what
>> xfsprogs version, relevant log snippets, etc.
> 
> Hi Stan,
> 
> Thanks for responding. Sure, I'll try and give more information.
> 
> I got some automated emails this Sunday about I/O errors coming from the computer (which is a Dell Poweredge 2950 w/ a connected 16 bay hardware RAID which is connected itself to 4 16 bay JBODs. The RAID controller is connected via SAS / LSI Fusion card to the Poweredge - Nimbus). It was Sunday, so I just logged in, rebooted, ran xfs_repair, then mounted the filesystem back. I tried a quick little write test, just to make sure I could write a file to it and read it back and called it a day until work the next day. When I came into work, I looked at the volume more closely and noticed that the filesystem shrank as I stated. Each of the RAID/JBODs is configured as a separate device and represents one physical volume in my LVM2 scheme, and those physical volumes are then combined into one logical volume. Then the filesystem sits on top of this. One one of the physical volumes  (PVs) - on /dev/sdc1, I noticed when I ran pvdisplay that of the 12.75 TB comprising the volume, 12.

00!
>   TB was being shown as 'not usable'. Usually this number is a couple of megabytes. So, after staring at this a while, I ran pvresize on that PV. The volume then listed 12.75 as usable, with a couple of megabytes not usable as one would expect. I then gave the command xfs_growfs on my filesystem and once again the file system was back to 62 TB. But it was showing the increased space as free space, instead of only 4.x TB of it as free as before all this happened. I then ran xfs_repair on this again, thinking it might find the missing data. Instead the filesystem decreased back to 51 TB. I rebooted and tried again a couple of times and the same thing happened. I'd really, really like to get that data back somehow and also to get the filesystem to where we can start using it again.
> 
> Version 2.9.4 of xfsprogs. xfs_repair line used 'xfs_repair /dev/vg1/vol5', vol5 being the LVM2 logical volume.  I spoke with tech support from my RAID vendor and he said he did not see any sign of errors with the RAID itself for what that is worth.
> 
> Nimbus is the hostname of the computer that is connected to the RAID/JBODs unit. The other computers (compute-0-XX) are only connected via NFS to the RAID/JBODs.
> 
> I've tried to provide a lot here,  but if I can provide any more information, please let me know. Thanks very much,
> 
> Eli
> 
> I'm trying to post logs, but my emails keep getting bounced. I'll see if this one makes it.

We just need the snippets relating to the problem at hand, not an entire
syslog file.  I'm guessing you attempted to attach the entire log, which is
likely what caused the rejection by the list server.

You said you received an email alert when the first errors occurred.
Correlate the time stamp in that alert msg to lines in syslog relating to the
LSI controller, LVM, XFS, etc.  Also grab the log entries for each xfs_repair
run you performed, along with log entries for the xfs_growfs operation.

Log errors/information is always critical to solving problems such as this.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs