Re: Hot-replace for RAID5

Patrik Horník <patrik@xxxxxx> · Tue, 15 May 2012 21:39:10 +0200

BTW thank you very much for the fix for layout=preserve. As soon as
current reshape finishes, I am going to other arrays.

Are regressions in 2.3.4 serious and so to which version I should
apply the patch? Or when you looked at the code, should
layout=left-symmetric-6 work in 2.3.2?

In regard reshaping speed, estimation when doing things a lot more
sequentially gives much higher speeds. Lets say 48 MB backup, 6 drives
with 80 MB/s sequential speed. If you do reshaping like this:
- Read 8 MB sequential from each drive in parallel, 0.1 s
- Then write it to backup, 48/80 = 0.6 s
- Calculate Q for something like 48 MB (guessing 0.05 s) and writing
it back to diff drives in parallel in 0.1 s. Because it is in the
cache and you are only writing  in this phase (?), there is not back
and forth seeking and rotational latency applies only couple of times
altogether, lets say 0.02.
- Update superblock and move header back, two worst seeks, 0.03 s (I
dont know how often do you update superblocks?)

you process 8 MB in cca 0.9 s, so speed in this scenario should be cca 9 MB/s.

I guess the main real difference when you logically doing it in
stripes can be that when you waiting for completion of writing chunks
(are you waiting for real completion of writes?), the difference
between first and last drive is often long enough to need wait one or
more rotations for writing another stripe. If that is the case, you
need add cca 128 * lets say 1.5 * 0.005 s = 0.64 s and so we are down
to cca 4.3 MB/s theoretically.

Patrik

On Tue, May 15, 2012 at 2:13 PM, NeilBrown <neilb@xxxxxxx> wrote:
> On Tue, 15 May 2012 13:56:58 +0200 Patrik Horník <patrik@xxxxxx> wrote:
>
>> Anyway increasing it to 5K did not help and drives don't seem to be
>> fully utilized.
>>
>> Does the reshape work something like this:
>> - Read about X = (50M / N - 1 / stripe size) stripes from drives and
>> write them to the backup-file
>> - Reshape X stripes one by another sequentially
>> - Reshaping stripe by reading chunks from all drives, calculate Q,
>> writing all chunks back and doing I/O for next stripe only after
>> finishing previous one?
>>
>> So after increasing stripe_cache_size the cache should hold stripes
>> after backing them and so reshaping should not need to read them from
>> drives again?
>>
>> Cant the slow speed be caused by some synchronization issues? How are
>> the stripes read for writing them to backup-file? Is it done one by
>> one, so I/Os for next stripe are issued only after having read the
>> previous stripe completely? Are they issued in maximum parallel way
>> possible?
>
> There is as much parallelism as I could manage.
> The backup file is divided into 2 sections.
> Write to one,  then the other, then invalidate the first and write to it etc.
> So while one half is being written, the data in the other half is being
> reshaped in the array.
> Also the stripe reads are scheduled asynchronously and as soon as a stripe is
> fully available, the Q is calculated and they are scheduled for write.
>
> The slowness is due to continually having to seek back a little way to over
> write what has just be read, and also having to update the metadata each time
> to record where we are up to.
>
> NeilBrown
>
>
>>
>> Patrik
>>
>>
>> On Tue, May 15, 2012 at 1:28 PM, NeilBrown <neilb@xxxxxxx> wrote:
>> > On Tue, 15 May 2012 13:16:42 +0200 Patrik Horník <patrik@xxxxxx> wrote:
>> >
>> >> Can I increase it during reshape by echo N >
>> >> /sys/block/mdX/md/stripe_cache_size?
>> >
>> > Yes.
>> >
>> >
>> >>
>> >> How is the size determined? I have only 1027 while having 8 GB system memory...
>> >
>> > Not very well.
>> >
>> > It is set to 256, or the minimum size needed to allow the reshape to proceed
>> > (which means about 4 chunks worth).  I should probably add some auto-sizing
>> > but that sort of stuff is hard :-(
>> >
>> > NeilBrown
>> >
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html