On 05/01/2010 11:44 PM, Janos Haar wrote:
But you are right, because the sync_min option not works for
rebuilding disks, only for resyncing. (it is too smart to do the trick
for me)
I think... unless bitmaps really do some magic in this, flagging the
newly introduced disk as more recent than parity data... but do they
really do this? people correct me if I'm wrong.
Bitmap manipulation should work.
I think i know how to do that, but the data is more important than try
it on my own.
I want to wait until somebody support this.
... or somebody have another good idea?
Firstly: do you have any backup of your data? If not, before doing any
experiment I suggest that you back up important stuff. This can be done
with rsync, and reassembling the array every time it goes down. I
suggest to put the array in readonly mode (mdadm --readonly /dev/md3):
this should prevent resyncs from starting automatically, and AFAIR even
prevent drives being dropped because of read errors (but you can't use
it during resyncs or rebuilds). Resyncs are bad because they will
eventually bring down your array. Don't use DM when doing this.
Now, for the real thing, instead of experimenting with bitmaps, I
suggest you try and see if the normal MD resync works now. If that works
then you can do the normal rebuild.
*Pls note that: DM should not be needed!* - I know that you have tried
resyncing with DM COW under MD and that one doesn't work well in this
case, but in fact DM should not be needed.
We pointed you to DM around Apr 23rd because at that time we thought
that your drives were dropping for uncorrectable read error, but we had
guessed wrong.
The general MD phylosophy is that if there is enough parity
informations, drives are not dropped just for a read error. Upon read
error MD recomputes the value of the sector from the parity information,
and then it attempts rewriting the block in place. During this rewrite
the drive performs a reallocation, moving the block to a hidden spare
region. If this rewrite fails it means that the drive is out of spare
sectors and this is considered to be a major failure for MD, and only at
that point the drive is dropped.
So we thought this was the reason also in your case, but we were wrong,
in your case it was because of an MD bug, which is the one for which I
submitted the patch.
So it should work now (without DM). And I think this is the safest thing
you can try. Having a backup is always better though.
So start the resync without DM and see if it goes through to the end
without dropping drives. You can use sync_min to cut the dead times.
For max safety you could first try resyncing only one chunk from the
region of the damaged sectors, so to provoke only a minimum amount of
rewrites. Set the sync_min to the location of the errors, and sync_max
to just one chunk above. See what happens...
If it rewrites correctly and the drive is not dropped, then run "check"
again on the same region and see if "cat /sys/block/md3/md/mismatch_cnt"
still returns zero (or the value it was before the rewrite). If it is
zero (or anyway has not changed value) it means the block was really
rewritten with the correct value: recovery of one sector really works
for raid6 in singly-degraded state. Then the procedure is safe, as far
as I understand, and you can go ahead on the other chunks.
When all damaged sectors are reallocated, there are no more read errors,
and the mismatch_cnt is still at zero, you can go ahead replacing the
defective drive.
There are a few reasons that can still make the resync fail if we are
really unlucky, but dmesg should point us to the right direction in that
case.
Also remember that the patch still needs testing... currently it is not
really tested because DM drops the drive before MD. We would need to
know if raid6 is behaving like a raid6 now or it's still behaving like a
raid5...
Thank you
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html