Re: XFS File system in trouble

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Leslie,

My two cents here, it appears you are using AMD FX CPU on ASUS Sabertooth motherboard?

I would strongly suggest you use unbuffered ECC DIMMs in your system.  Mcelog will warn of ECC errors in your DIMMs.  ECC will correct single bit errors and at least detect multi bit errors.

I had AMD Opteron servers with registered ECC DIMMs with continuous correctable ECC errors running HPC jobs for up to one month without any crashes until I could schedule down time for DIMM replacement.  The errors will be flagged either in BMC (service processor) or mcelog.

All my PC / workstations at work place and at home with consumer AMD Althon 64 and AMD Phenom II had unbuffered ECC DIMMs on ASUS motherboards.  I never had any memory errors; I know that if there are memory errors I will get notified.


Chin Gim Leong


From: Leslie Rhorer <lrhorer@xxxxxxxxxxxx>
To: Martin Papik <mp6058@xxxxxxxxx>
Cc: xfs@xxxxxxxxxxx
Sent: Monday, 20 July 2015, 16:35
Subject: Re: XFS File system in trouble

On 7/20/2015 3:05 AM, Martin Papik wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA512
>
>
> Since you've already found one HW related fault, would you consider
> booting into memtest for a couple of passes just to be on the safe
> side.

    I did that after confirming the one stick of memory was bad.  Twice.  I
got over 20,000 errors on the bad stick, and 0 on the good one.  I also
swapped the locations on the motherboard, and the bad stick still failed
while the good one passed 100%.

> And did you by any chance look at SMART if applicable and
> possibly running a test on the drives.

    Yes. SMART found no errors, but think about it.  Every time tar tries
to create a directory when untarring that file in that location, the
file system croaks when it tries to create a directory. Not when reading
and not when writing other than when it creates a directory.  When I
create the directory manualy, the process quits failing at that point
and fails later on during a different directory create.  The array
remains intact when reading, and dmesg shows no drive errors.  I've
re-synced the array, which reads every byte on all 8 drives without a
single mismatch - several times.  To my knowledge, no read has ever
failed except after the filesystem goes offline.  I thought reads were
failing during the CRC checks, but that was a red herring.

> Another test I sometimes do
> when I'm unsure about disks is "cat /dev/sda > /dev/null" (i.e. a
> whole disk read test)

echo repair > /sys/block/md0/md/sync_action reads not one drive, but
every byte on all 8 drives.

> and see (dmesg) if any errors show up, unless

    'Nary one, and no mismatches.




> you're willing to run badblocks in a read-write nondestructive mode.
> In my experience the read test or badblocks can be run simultaneously
> with smartctl -t long. But as a start I'd look at smartctl --all
> /dev/sd? and see if there are any bad signs. I hope this helps. Good luck
>
>
> On 07/20/2015 10:41 AM, Leslie Rhorer wrote:
>> On 7/19/2015 6:27 PM, Dave Chinner wrote:
>>> On Sat, Jul 18, 2015 at 08:02:50PM -0500, Leslie Rhorer wrote:
>>>>
>>>> I found the problem with md5sum (and probably nfs, as well).
>>>> One of the memory modules in the server was bad.  The problem
>>>> with XFS persists.  Every time tar tried to create the
>>>> directory:
>>>
>>> Now you need to run xfs_repair.
>>
>> I do that every time the array implodes.  It makes no difference.
>> It never mentions cleaning the structure tar says needs cleaning,
>> and the next time I run tar on that file, the filesystem craters.
>>
>> _______________________________________________ xfs mailing list
>> xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1
>
> iQIcBAEBCgAGBQJVrKuzAAoJELsEaSRwbVYrdjoP/3n1W9YtcpdiDoylp6tDYcjF
> vEVz7IWLv2cOky8Lp+0WAZ4Z0WMhcutFzT571H1Vc+jT/UgO25pQHa3yLYTboPuZ
> +tBidVUycs7ZIr9QCZFs2uPQ/7YstamB+F7paCTMKtOJJr5CZLiYX4iyJ9sFmWVY
> UFPAIhyoqD5CFgoaAkwCmk50kNiT0aPM7egizIUVEt14cWuxZxMN0NIJ5b0WJfAk
> qtNQjstVI/xYDgsImm2ZAm19SfOG9ltm2G9zafRr6lR6rRtXjtZX8zEg0l/o9XUw
> OifghjoSup8OCzvX6+4+Soj/3mCKZv4rkBm3exf4YzfQ9eVG6Ktele2rLIs1sl3O
> hUrZUNEl8hYGJeb5gBHFV/TLWDMMwNde/6JiBVy0V8EbDF1lvR4jYpUwThOE0jyL
> ZbzZe4N/B0qvB1OpLDkHrMVm9NPtDkfXdTtM2kRmo5955xtkK09yHF/v64kz7IKc
> 2rM5pOwTR6HWE8RF2j9UujgPjw6nEUuY01TvIMGYzMfkJTI+sVjeDQfwnPG8tzIa
> x4uLa4vTrBD5IaICjAmQiY69qqmt5Vg42G4latZVTYQLelvWQ774mXZfgfT/GtbT
> RKzVwvYowWr/EBhtp7ix/1rWANTFiX0lxOPnRmUFvu8UJnyZhR0/EYbJYy1+jTt7
> O7hZMfAayQBsnVcSK1JC
> =3Ubd
> -----END PGP SIGNATURE-----
>

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs


_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs

[Index of Archives]     [Linux XFS Devel]     [Linux Filesystem Development]     [Filesystem Testing]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux