Re: resync'ing - what is going on

Neil Brown <neilb@xxxxxxx> · Fri, 11 Jul 2008 14:51:33 +1000

On Thursday July 10, keld@xxxxxxxx wrote:
> I would like to know what is going on wrt resyncing, how it is done.
> This is because I have some ideas to speed up the process. 
> I have noted for a 4 drive raid10,f2 array that only about 25 % of the
> IO speed is used during the rebuid, I would like to have something like
> 90 % as a goal.

"resync" and "recovery" are handled very differently in raid10.
"check" and "repair" are special cases of "resync".

"recovery" walks addresses from the start to the end of the component
drives.
At each address, it considers each drive which is being recovered and
finds a place on a different device to read the block for the current
(drive,address) from.  It schedules a read and when the read request
completes it schedules the write.

On an f2 layout, this will read one drive from halfway to the end,
then from the start to halfway, and will write the other drive
sequentially.

"resync" walks the addresses from the start to end of the array.
At each address it reads every device block which stores that
array block.  When all the reads complete the results are compared.
If they are not all the same, the "first" block is written out
to the others.  (I think I might have told you before that it reads
one block and writes the others.  I checked the code and what is
wrong).

Here "first" means (I think) the block with the earliest device
address, and if there are several of those, the block with the least
device index.

So for f2, this will read from both the start and the middle of
both devices.  It will read 64K at a time, so you should get at least
a 32K read at each position before a seek (more with a larger chunk
size).

Clearly this won't be fast.

The reason this algorithm was chosen was that it makes sense for every
possible raid10 layout, even though it might not be optimal for some
of them.

> 
> This is especially for raid10,f2, where I think I can make it much
> better, but posssibly also for other raid types, as input to an
> explanation on the wiki of what is really going on. 

Were I to try to make it fast for f2, I would probably shuffle the
bits in each request so that it did all the 'odd' chunks first, then
all the even chunks.
e.g. map
   0 1 2 3 4 5 6 7 8 ...
to
   0 1 4 5 8 9 .....  2 3 6 7 10 11 ....
(assuming a chunk size of '2').

The problem with this is that if you shutdown while part way though a
resync,  and then boot into a kernel which used a different
sequence, it would finish the resync checking the wrong blocks.
This is annoying but should not be insurmountable.

This way we leave the basic algorithm the same, but introduce
variations in the sequence for different specific layouts.

> 
> Are there references on the net? I tried to look but did not really find
> something.

Just the source, sorry.

> 
> I don't really understand why resync is going on for raid10,f2.
> But maybe it checks all of the array, and checks that the two copies are
> identical. Is that so? I got some communication with Neil that some
> writing is involved in the resync, I don't understand why. 

raid1 does resync simply by reading one device and writing all the
others, and this is conceptually easiest.  I had mistakenly thought
that I had used the same approach in raid10.

> 
> And what happens if a discrepancy is found? Which of the 2 copies are the
> good one? Maybe one could look if there are any CRC errors, or disk read
> retries going on. I could understand if it was a raid10,f3 - then if one
> was different from the 2 other copies - you could correct the odd copy.

There is no "good" block - if they are different, then all are wrong.
md/raid just tries to return a consistent value, and leave it up to
the filesystem to find and correct any errors.

> 
> For raid5 and raid6 I could imagine that the parity blocks were cheked.

If any inconsistency is found during a resync of raid4/5/6 the parity
blocks are changed to remove the inconsistency.  This may not be
"right", but it is least likely to be "wrong".

> 
> I could of cause read the code, but I would like an overview before
> dwelving into that part.

Sensible :-)
Enjoy your reading.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html