Re: RAID6 check found different events, how should I proceed?

NeilBrown <neilb@xxxxxxx> · Tue, 9 Aug 2011 08:57:04 +1000

On Sat, 6 Aug 2011 17:02:48 +0100 Mathias Burén <mathias.buren@xxxxxxxxx>
wrote:

> On 6 August 2011 14:23, Mathias Burén <mathias.buren@xxxxxxxxx> wrote:
> > My RAID6 is currently degraded with one HDD (panic mail on the list),
> > and my weekly cron job kicked in doing the RAID6 check action. This is
> > the result:
> >
> > DEV     EVENTS  REALL   PEND    UNCORR  CRC     RAW     ZONE    END
> > sdb1    6239487 0               0               0               2       0               0
> > sdc1    6239487 0               0               0               0       0               0
> > sdd1    6239487 0               0               0               0       0               0
> > sde1    6239487 0               0               0               0       0               0
> > sdf1    6239490 0               0               0               0       49              6
> > sdg1    6239491 0               0               0               0       0               0
> > sdh1    (missing, on RMA trip)
> >
> (snip)
> > * Should I run a repair?
> > * Chould I run a check again, to see if the event count changes?
> > * Is it likely I've 2 more bad harddrives that will die soon?
> > * Is it wise to run another smartctl -t long on all devices?
> >
> > Thanks,
> > Mathias
> >
> 
> A followup;
> 
> I ran smartctl -t long on all devices, and they all passed, SMART is
> fine. The number of events is also the same for all HDDs now:
> 
> DEV	EVENTS	REALL	PEND	UNCORR	CRC	RAW	ZONE	END
> sdb1	6244415	0	0	0	2	0	0	
> sdc1	6244415	0	0	0	0	0	0	
> sdd1	6244415	0	0	0	0	0	0	
> sde1	6244415	0	0	0	0	0	0	
> sdf1	6244415	0	0	0	0	49	6	
> sdg1	6244415	0	0	0	0	0	0	
> sdh1								
> 
> This is without me running repair or anything like that.

The thing that you did which produced the change was that you let time pass.

Presumably there was a time delay (maybe small) between extracting the
'events' number from sde1 and sdf1, then sdf1 and sdg1.  During these times
the events on all devices in the array was updated.  This implies some thread
was writing, but possibly not writing very heavily.

When you sampled them all the second time and got the same number there were
presumably no writes happening, so the event numbers didn't change.

When there are occasional writes the array oscillates between  'clean' and
'active' and each change updates the 'events' number.

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html