Re: unbelievably bad performance: 2.6.27.37 and raid6

Goswin von Brederlow <goswin-v-b@xxxxxx> · Tue, 03 Nov 2009 20:26:40 +0100

Michael Evans <mjevans1983@xxxxxxxxx> writes:

> On Tue, Nov 3, 2009 at 5:07 AM, Goswin von Brederlow <goswin-v-b@xxxxxx> wrote:
>> "NeilBrown" <neilb@xxxxxxx> writes:
>>
>>> A reshape is a fundamentally slow operation.  Each block needs to
>>> be read and then written somewhere else so there is little opportunity
>>> for streaming.
>>> An in-place reshape (i.e the array doesn't get bigger or smaller) is
>>> even slower as we have to take a backup copy of each range of blocks
>>> before writing them back out.  This limits streaming even more.
>>>
>>> It is possible to get it fast than it is by increasing the
>>> array's stripe_cache_size and also increasing the 'backup' size
>>> that mdadm uses.  mdadm-3.1.1 will try to do better in this respect.
>>> However it will still be significantly slower than e.g. a resync.
>>>
>>> So reshape will always be slow.  It is a completely different issue
>>> to filesystem activity on a RAID array being slow.  Recent reports of
>>> slowness are, I think, not directly related to md/raid.  It is either
>>> the filesystem or the VM or a combination of the two that causes
>>> these slowdowns.
>>>
>>>
>>> NeilBrown
>>
>> Now why is that? Lets leave out the case of an in-place
>> reshape. Nothing can be done to avoid making a backup of blocks
>> there, which severly limits the speed.
>>
>> But the most common case should be growing an array. Lets look at the
>> first few steps or 3->4 disk raid5 reshape. Each step denotes a point
>> where a sync is required:
>>
>> Step 0         Step 1         Step 2         Step 3         Step 4
>>  A  B  C  D     A  B  C  D     A  B  C  D     A  B  C  D     A  B  C  D
>> 00 01  p  x    00 01 02  p    00 01 02  p    00 01 02  p    00 01 02  p
>> 02  p 03  x    02  p 03  x    03 04  p 05    03 04  p 05    03 04  p 05
>>  p 04 05  x     p 04 05  x     x  x  x  x    06  p 07 08    06  p 07 08
>> 06 07  p  x    06 07  p  x    06 07  p  x     x  x  x  x     p 09 10 11
>> 08  p 09  x    08  p 09  x    08  p 09  x    08  p 09  x     x  x  x  x
>>  p 10 11  x     p 10 11  x     p 10 11  x     p 10 11  x     x  x  x  x
>> 12 13  p  x    12 13  p  x    12 13  p  x    12 13  p  x    12 13  p  x
>> 14  p 15  x    14  p 15  x    14  p 15  x    14  p 15  x    14  p 15  x
>>  p 16 17  x     p 16 17  x     p 16 17  x     p 16 17  x     p 16 17  x
>> 18 19  p  x    18 19  p  x    18 19  p  x    18 19  p  x    18 19  p  x
>> 20  p 21  x    20  p 21  x    20  p 21  x    20  p 21  x    20  p 21  x
>>  p 22 23  x     p 22 23  x     p 22 23  x     p 22 23  x     p 22 23  x
>> 24 25  p  x    24 25  p  x    24 25  p  x    24 25  p  x    24 25  p  x
>> 26  p 27  x    26  p 27  x    26  p 27  x    26  p 27  x    26  p 27  x
>>  p 28 29  x     p 28 29  x     p 28 29  x     p 28 29  x     p 28 29  x
>>
>> Step 5         Step 6         Step 7         Step 8         Step 9
>>  A  B  C  D     A  B  C  D     A  B  C  D     A  B  C  D     A  B  C  D
>> 00 01 02  p    00 01 02  p    00 01 02  p    00 01 02  p    00 01 02  p
>> 03 04  p 05    03 04  p 05    03 04  p 05    03 04  p 05    03 04  p 05
>> 06  p 07 08    06  p 07 08    06  p 07 08    06  p 07 08    06  p 07 08
>>  p 09 10 11     p 09 10 11     p 09 10 11     p 09 10 11     p 09 10 11
>>  x  x  x  x    12 13 14  p    12 13 14  p    12 13 14  p    12 13 14  p
>>  x  x  x  x    15 16  p 17    15 16  p 17    15 16  p 17    15 16  p 17
>> 12 13  p  x     x  x  x  x    18  p 19 20    18  p 19 20    18  p 19 20
>> 14  p 15  x     x  x  x  x     p 21 22 23     p 21 22 23     p 21 22 23
>>  p 16 17  x     x  x  x  x    24 25 26  p    24 25 26  p    24 25 26  p
>> 18 19  p  x    18 19  p  x     x  x  x  x    27 28  p 29    27 28  p 29
>> 20  p 21  x    20  p 21  x     x  x  x  x    30  p 31 32    30  p 31 32
>>  p 22 23  x     p 22 23  x     x  x  x  x     p 33 34 35     p 33 34 35
>> 24 25  p  x    24 25  p  x     x  x  x  x    36 37 38  p    36 37 38  p
>> 26  p 27  x    26  p 27  x    26  p 27  x     x  x  x  x    39 40  p 41
>>  p 28 29  x     p 28 29  x     p 28 29  x     x  x  x  x    42  p 43 44
>>
>>
>> In Step 0 and Step 1 the source and destination stripes overlap so a
>> backup is required. But at Step 2 you have a full stripe to work with
>> safely, at Step 4 2 stripes are save, Step 6 3 stripes and Step 7 4
>> stripes. As you go the safe region gets larger and larger requiring
>> less and less sync points.
>>
>> Idealy the raid reshape should read as much data from the source
>> stripes as possible in one go and then write it all out in one
>> go. Then rince and repeat. For a simple implementation why not do
>> this:
>>
>> 1) read reshape-sync-size from proc/sys, default to 10% ram size
>> 2) sync-size = min(reshape-sync-size, size of safe region)
>> 3) setup internal mirror between old (read-write) and new stripes (write only)
>> 4) read source blocks into stripe cache
>> 5) compute new parity
>> 6) put stripe into write cache
>> 7) goto 3 until sync-size is reached
>> 8) sync blocks to disk
>> 9) record progress and remove internal mirror
>> 10) goto 1
>>
>> Optionally in 9 you can skip recording the progress if the safe region
>> is big enough for another read/write pass.
>>
>> The important idea behind this would be that, given enough free ram,
>> there is a large linear read and large linear write alternating. Also,
>> since the normal cache is used instead of the static stripe cache, if
>> there is not enough ram then writes will be flushed out prematurely.
>> This will lead to a degradation of performance but that is better than
>> running out of memory.
>>
>> I have 4GB on my desktop with at least 3GB free if I'm not doing
>> anything expensive. A raid-reshape should be able to do 3GB linear
>> read and write alternatively. But I would already be happy if it would
>> do 256MB. There is lots of opportunity for streaming. It might justbe
>> hard to get the kernel IO system to cooperate.
>>
>> MfG
>>        Goswin
>>
>
> Skimming your message I agree with the major points, however you're
> only considering the best case scenario (which is how it probably
> should run for performance).  There is also the worst-case scenario
> where a device, driver, OS, or even power (supply lets say) fails in
> mid-operation.
>
> If there isn't a gap created due to the reshape (obviously it would
> continue to grow the more the reshape proceeds) then it's still an 'in
> place' operation (which I argue should be done in the largest block
> possible within memory, but with data backed up on a device).

That is considered:
2) sync-size = min(reshape-sync-size, size of safe region)

At first the safe region is 0 and you need to backup some data. Then
the safe region is one stripe and things will go slowly. But as you
can see above and as you say the region quickly grows. I think the
region grows quickly enough that only a minimum of data needs to be
backuped up followed by a few slow iterations. It gets faster quickly
enough. But yeah, you can back up more at the start to get a larger
initial safe region.

> Growing operations obviously have free space on the new device, and
> further as the operation proceeds there will be a growing gap between
> the re-written data and the old copy of the data.
>
> Shrinking operations, counter-intuitively, also have a growing area of
> free space; at the end of the device.  Working backwards, after a
> given number of stripes, the operation should be just as safe, if in
> reverse, as a normal grow.
>
> In any of the three cases, the largest possible write window per
> device should be used to take advantage of the usual gains in speed.

MfG
        Goswin
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html