Re: easily reproducible filesystem crash on rebuilding array

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 17 Dec 2014 06:58:15 +1100

On Tue, Dec 16, 2014 at 12:34:05PM +0100, Emmanuel Florac wrote:
> The RAID hardware is an adaptec 71685 running the latest firmware
> ( 32033 ). This is a 16 drives RAID-6 array of 4 TB HGST drives. The
> problem occurs repeatly with any combination of 7xx5 controllers and 3
> or 4 TB HGST drives in RAID-6 of various types, with XFS or JFS (it
> never occurs with either ext4 or reiserfs).

Do you have systems with any other type of 3/4TB drives in them?

> As I mentioned, when the disk drives cache is on the corruption is
> serious. With disk cache off, the corruption is minimal, however the
> filesystem shuts down.

That really sounds like a hardware problem - maybe with the disk
drives themselves, not necessarily the controller.

> The filesystem has been primed with a few (23) terabytes of mixed data
> with both small (few KB or less), medium, and big (few gigabytes or
> more) files. Two simultaneous, long running copies are made ( cp -a
> somedir someotherdir) , while three simultaneous, long running read
> operations are run ( md5sum -c mydir.md5 mydir), while the array is
> busy rebuilding. Disk usage (as reported by iostat -mx 5) stays solidly
> at 100%, with a continuous throughput of a few hundred megabytes per
> second. The full test runs for about 12 hours (when not failing), and
> ends up copying 6 TB or so, and md5summing 12 TB or so.
> 
> > I'd start with upgrading the firmware on your RAID controller and
> > turning the XFS error level up to 11....
> 
> The firmware is the latest available. How do I turn logging to 11
> please ?

# echo 11 > /proc/sys/fs/xfs/error_level

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs