I'd do _nothing_ until I got a replacement drive. Then plug that in
and let it regain full redundancy.
After that you can start stressing the disks with the actions you
suggested if you like.
Alex.
Zitat von Mathias Burén <mathias.buren@xxxxxxxxx>:
First, thanks for this:
The primary purpose of data scrubbing a RAID is to detect & correct
read errors on any of the member devices; both check and repair
perform this function. Finding (and w/ repair correcting) mismatches
is only a secondary purpose - it is only if there are no read errors
but the data copy or parity blocks are found to be inconsistent that a
mismatch is reported. In order to repair a mismatch, MD needs to
restore consistency, by over writing the inconsistent data copy or
parity blocks w/ the correct data. But, because the underlying member
devices did not return any errors, MD has no way of knowing which
blocks are correct, and which are incorrect; when it is told to do a
repair, it makes the assumption that the first copy in a RAID1 or
RAID10, or the data (non-parity) blocks in RAID4/5/6 are correct, and
corrects the mismatch based on that assumption.
That assumption may or may not be correct, but MD has no way of
determining that reliably - but the user might be able to, by using
additional knowledge or tools, so MD gives the user the option to
perform data scrubbing either with (repair) or without (check) MD
correcting the mismatches using that assumption.
I hope that answers your question,
Beolach
My RAID6 is currently degraded with one HDD (panic mail on the list),
and my weekly cron job kicked in doing the RAID6 check action. This is
the result:
DEV EVENTS REALL PEND UNCORR CRC RAW ZONE END
sdb1 6239487 0 0 0 2 0 0
sdc1 6239487 0 0 0 0 0 0
sdd1 6239487 0 0 0 0 0 0
sde1 6239487 0 0 0 0 0 0
sdf1 6239490 0 0 0 0 49 6
sdg1 6239491 0 0 0 0 0 0
sdh1 (missing, on RMA trip)
(so the SMART is actually fine for all drives)
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdf1[5] sdg1[0] sdd1[4] sde1[7] sdc1[3] sdb1[1]
9751756800 blocks super 1.2 level 6, 64k chunk, algorithm 2
[7/6] [UUUUU_U]
unused devices: <none>
/dev/md0:
Version : 1.2
Creation Time : Tue Oct 19 08:58:41 2010
Raid Level : raid6
Array Size : 9751756800 (9300.00 GiB 9985.80 GB)
Used Dev Size : 1950351360 (1860.00 GiB 1997.16 GB)
Raid Devices : 7
Total Devices : 6
Persistence : Superblock is persistent
Update Time : Sat Aug 6 14:13:08 2011
State : clean, degraded
Active Devices : 6
Working Devices : 6
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 64K
Name : ion:0 (local to host ion)
UUID : e6595c64:b3ae90b3:f01133ac:3f402d20
Events : 6239491
Number Major Minor RaidDevice State
0 8 97 0 active sync /dev/sdg1
1 8 17 1 active sync /dev/sdb1
4 8 49 2 active sync /dev/sdd1
3 8 33 3 active sync /dev/sdc1
5 8 81 4 active sync /dev/sdf1
5 0 0 5 removed
7 8 65 6 active sync /dev/sde1
So sdf1 and sdg1 have a different event count. Does this mean the HDDs
have silently corrupted the data? I have no way of checking if the
data itself is corrupt or not, except for perhaps a fsck of the
filesystem? Does that make sense?
* Should I run a repair?
* Chould I run a check again, to see if the event count changes?
* Is it likely I've 2 more bad harddrives that will die soon?
* Is it wise to run another smartctl -t long on all devices?
Thanks,
Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html