Re: unbelievably bad performance: 2.6.27.37 and raid6

Michael Evans <mjevans1983@xxxxxxxxx> · Tue, 3 Nov 2009 08:28:07 -0800

On Tue, Nov 3, 2009 at 5:07 AM, Goswin von Brederlow <goswin-v-b@xxxxxx> wrote:
> "NeilBrown" <neilb@xxxxxxx> writes:
>
>> A reshape is a fundamentally slow operation.  Each block needs to
>> be read and then written somewhere else so there is little opportunity
>> for streaming.
>> An in-place reshape (i.e the array doesn't get bigger or smaller) is
>> even slower as we have to take a backup copy of each range of blocks
>> before writing them back out.  This limits streaming even more.
>>
>> It is possible to get it fast than it is by increasing the
>> array's stripe_cache_size and also increasing the 'backup' size
>> that mdadm uses.  mdadm-3.1.1 will try to do better in this respect.
>> However it will still be significantly slower than e.g. a resync.
>>
>> So reshape will always be slow.  It is a completely different issue
>> to filesystem activity on a RAID array being slow.  Recent reports of
>> slowness are, I think, not directly related to md/raid.  It is either
>> the filesystem or the VM or a combination of the two that causes
>> these slowdowns.
>>
>>
>> NeilBrown
>
> Now why is that? Lets leave out the case of an in-place
> reshape. Nothing can be done to avoid making a backup of blocks
> there, which severly limits the speed.
>
> But the most common case should be growing an array. Lets look at the
> first few steps or 3->4 disk raid5 reshape. Each step denotes a point
> where a sync is required:
>
> Step 0         Step 1         Step 2         Step 3         Step 4
>  A  B  C  D     A  B  C  D     A  B  C  D     A  B  C  D     A  B  C  D
> 00 01  p  x    00 01 02  p    00 01 02  p    00 01 02  p    00 01 02  p
> 02  p 03  x    02  p 03  x    03 04  p 05    03 04  p 05    03 04  p 05
>  p 04 05  x     p 04 05  x     x  x  x  x    06  p 07 08    06  p 07 08
> 06 07  p  x    06 07  p  x    06 07  p  x     x  x  x  x     p 09 10 11
> 08  p 09  x    08  p 09  x    08  p 09  x    08  p 09  x     x  x  x  x
>  p 10 11  x     p 10 11  x     p 10 11  x     p 10 11  x     x  x  x  x
> 12 13  p  x    12 13  p  x    12 13  p  x    12 13  p  x    12 13  p  x
> 14  p 15  x    14  p 15  x    14  p 15  x    14  p 15  x    14  p 15  x
>  p 16 17  x     p 16 17  x     p 16 17  x     p 16 17  x     p 16 17  x
> 18 19  p  x    18 19  p  x    18 19  p  x    18 19  p  x    18 19  p  x
> 20  p 21  x    20  p 21  x    20  p 21  x    20  p 21  x    20  p 21  x
>  p 22 23  x     p 22 23  x     p 22 23  x     p 22 23  x     p 22 23  x
> 24 25  p  x    24 25  p  x    24 25  p  x    24 25  p  x    24 25  p  x
> 26  p 27  x    26  p 27  x    26  p 27  x    26  p 27  x    26  p 27  x
>  p 28 29  x     p 28 29  x     p 28 29  x     p 28 29  x     p 28 29  x
>
> Step 5         Step 6         Step 7         Step 8         Step 9
>  A  B  C  D     A  B  C  D     A  B  C  D     A  B  C  D     A  B  C  D
> 00 01 02  p    00 01 02  p    00 01 02  p    00 01 02  p    00 01 02  p
> 03 04  p 05    03 04  p 05    03 04  p 05    03 04  p 05    03 04  p 05
> 06  p 07 08    06  p 07 08    06  p 07 08    06  p 07 08    06  p 07 08
>  p 09 10 11     p 09 10 11     p 09 10 11     p 09 10 11     p 09 10 11
>  x  x  x  x    12 13 14  p    12 13 14  p    12 13 14  p    12 13 14  p
>  x  x  x  x    15 16  p 17    15 16  p 17    15 16  p 17    15 16  p 17
> 12 13  p  x     x  x  x  x    18  p 19 20    18  p 19 20    18  p 19 20
> 14  p 15  x     x  x  x  x     p 21 22 23     p 21 22 23     p 21 22 23
>  p 16 17  x     x  x  x  x    24 25 26  p    24 25 26  p    24 25 26  p
> 18 19  p  x    18 19  p  x     x  x  x  x    27 28  p 29    27 28  p 29
> 20  p 21  x    20  p 21  x     x  x  x  x    30  p 31 32    30  p 31 32
>  p 22 23  x     p 22 23  x     x  x  x  x     p 33 34 35     p 33 34 35
> 24 25  p  x    24 25  p  x     x  x  x  x    36 37 38  p    36 37 38  p
> 26  p 27  x    26  p 27  x    26  p 27  x     x  x  x  x    39 40  p 41
>  p 28 29  x     p 28 29  x     p 28 29  x     x  x  x  x    42  p 43 44
>
>
> In Step 0 and Step 1 the source and destination stripes overlap so a
> backup is required. But at Step 2 you have a full stripe to work with
> safely, at Step 4 2 stripes are save, Step 6 3 stripes and Step 7 4
> stripes. As you go the safe region gets larger and larger requiring
> less and less sync points.
>
> Idealy the raid reshape should read as much data from the source
> stripes as possible in one go and then write it all out in one
> go. Then rince and repeat. For a simple implementation why not do
> this:
>
> 1) read reshape-sync-size from proc/sys, default to 10% ram size
> 2) sync-size = min(reshape-sync-size, size of safe region)
> 3) setup internal mirror between old (read-write) and new stripes (write only)
> 4) read source blocks into stripe cache
> 5) compute new parity
> 6) put stripe into write cache
> 7) goto 3 until sync-size is reached
> 8) sync blocks to disk
> 9) record progress and remove internal mirror
> 10) goto 1
>
> Optionally in 9 you can skip recording the progress if the safe region
> is big enough for another read/write pass.
>
> The important idea behind this would be that, given enough free ram,
> there is a large linear read and large linear write alternating. Also,
> since the normal cache is used instead of the static stripe cache, if
> there is not enough ram then writes will be flushed out prematurely.
> This will lead to a degradation of performance but that is better than
> running out of memory.
>
> I have 4GB on my desktop with at least 3GB free if I'm not doing
> anything expensive. A raid-reshape should be able to do 3GB linear
> read and write alternatively. But I would already be happy if it would
> do 256MB. There is lots of opportunity for streaming. It might justbe
> hard to get the kernel IO system to cooperate.
>
> MfG
>        Goswin
>

Skimming your message I agree with the major points, however you're
only considering the best case scenario (which is how it probably
should run for performance).  There is also the worst-case scenario
where a device, driver, OS, or even power (supply lets say) fails in
mid-operation.

If there isn't a gap created due to the reshape (obviously it would
continue to grow the more the reshape proceeds) then it's still an 'in
place' operation (which I argue should be done in the largest block
possible within memory, but with data backed up on a device).

Growing operations obviously have free space on the new device, and
further as the operation proceeds there will be a growing gap between
the re-written data and the old copy of the data.

Shrinking operations, counter-intuitively, also have a growing area of
free space; at the end of the device.  Working backwards, after a
given number of stripes, the operation should be just as safe, if in
reverse, as a normal grow.

In any of the three cases, the largest possible write window per
device should be used to take advantage of the usual gains in speed.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html