Re: easily reproducible filesystem crash on rebuilding array

Emmanuel Florac <eflorac@xxxxxxxxxxxxxx> · Wed, 17 Dec 2014 12:21:59 +0100

Le Wed, 17 Dec 2014 06:58:15 +1100
Dave Chinner <david@xxxxxxxxxxxxx> écrivait:

> On Tue, Dec 16, 2014 at 12:34:05PM +0100, Emmanuel Florac wrote:
> > The RAID hardware is an adaptec 71685 running the latest firmware
> > ( 32033 ). This is a 16 drives RAID-6 array of 4 TB HGST drives. The
> > problem occurs repeatly with any combination of 7xx5 controllers
> > and 3 or 4 TB HGST drives in RAID-6 of various types, with XFS or
> > JFS (it never occurs with either ext4 or reiserfs).
> 
> Do you have systems with any other type of 3/4TB drives in them?

No, only HGST drives.

> > As I mentioned, when the disk drives cache is on the corruption is
> > serious. With disk cache off, the corruption is minimal, however the
> > filesystem shuts down.
> 
> That really sounds like a hardware problem - maybe with the disk
> drives themselves, not necessarily the controller.

Actually the problem occurs without any error in the controller log, no
IO error, no disk time out, no bad block, nothing. So far I was pretty
confident about the Adaptec firmware being the culprit, I'm not so sure
now.

> > > I'd start with upgrading the firmware on your RAID controller and
> > > turning the XFS error level up to 11....
> > 
> > The firmware is the latest available. How do I turn logging to 11
> > please ?
> 
> # echo 11 > /proc/sys/fs/xfs/error_level
> 

Thanks done, while running again but *without using lvm* this time. I'm
changing one parameter at a time...

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@xxxxxxxxxxxxxx>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs