Re: unbelievably bad performance: 2.6.27.37 and raid6

Goswin von Brederlow <goswin-v-b@xxxxxx> · Tue, 03 Nov 2009 14:07:26 +0100

"NeilBrown" <neilb@xxxxxxx> writes:

> A reshape is a fundamentally slow operation.  Each block needs to
> be read and then written somewhere else so there is little opportunity
> for streaming.
> An in-place reshape (i.e the array doesn't get bigger or smaller) is
> even slower as we have to take a backup copy of each range of blocks
> before writing them back out.  This limits streaming even more.
>
> It is possible to get it fast than it is by increasing the
> array's stripe_cache_size and also increasing the 'backup' size
> that mdadm uses.  mdadm-3.1.1 will try to do better in this respect.
> However it will still be significantly slower than e.g. a resync.
>
> So reshape will always be slow.  It is a completely different issue
> to filesystem activity on a RAID array being slow.  Recent reports of
> slowness are, I think, not directly related to md/raid.  It is either
> the filesystem or the VM or a combination of the two that causes
> these slowdowns.
>
>
> NeilBrown

Now why is that? Lets leave out the case of an in-place
reshape. Nothing can be done to avoid making a backup of blocks
there, which severly limits the speed.

But the most common case should be growing an array. Lets look at the
first few steps or 3->4 disk raid5 reshape. Each step denotes a point
where a sync is required:

Step 0         Step 1         Step 2         Step 3         Step 4
 A  B  C  D     A  B  C  D     A  B  C  D     A  B  C  D     A  B  C  D
00 01  p  x    00 01 02  p    00 01 02  p    00 01 02  p    00 01 02  p
02  p 03  x    02  p 03  x    03 04  p 05    03 04  p 05    03 04  p 05
 p 04 05  x     p 04 05  x     x  x  x  x    06  p 07 08    06  p 07 08
06 07  p  x    06 07  p  x    06 07  p  x     x  x  x  x     p 09 10 11
08  p 09  x    08  p 09  x    08  p 09  x    08  p 09  x     x  x  x  x
 p 10 11  x     p 10 11  x     p 10 11  x     p 10 11  x     x  x  x  x
12 13  p  x    12 13  p  x    12 13  p  x    12 13  p  x    12 13  p  x
14  p 15  x    14  p 15  x    14  p 15  x    14  p 15  x    14  p 15  x
 p 16 17  x     p 16 17  x     p 16 17  x     p 16 17  x     p 16 17  x
18 19  p  x    18 19  p  x    18 19  p  x    18 19  p  x    18 19  p  x
20  p 21  x    20  p 21  x    20  p 21  x    20  p 21  x    20  p 21  x
 p 22 23  x     p 22 23  x     p 22 23  x     p 22 23  x     p 22 23  x
24 25  p  x    24 25  p  x    24 25  p  x    24 25  p  x    24 25  p  x
26  p 27  x    26  p 27  x    26  p 27  x    26  p 27  x    26  p 27  x
 p 28 29  x     p 28 29  x     p 28 29  x     p 28 29  x     p 28 29  x

Step 5         Step 6         Step 7         Step 8         Step 9
 A  B  C  D     A  B  C  D     A  B  C  D     A  B  C  D     A  B  C  D
00 01 02  p    00 01 02  p    00 01 02  p    00 01 02  p    00 01 02  p
03 04  p 05    03 04  p 05    03 04  p 05    03 04  p 05    03 04  p 05
06  p 07 08    06  p 07 08    06  p 07 08    06  p 07 08    06  p 07 08
 p 09 10 11     p 09 10 11     p 09 10 11     p 09 10 11     p 09 10 11
 x  x  x  x    12 13 14  p    12 13 14  p    12 13 14  p    12 13 14  p
 x  x  x  x    15 16  p 17    15 16  p 17    15 16  p 17    15 16  p 17
12 13  p  x     x  x  x  x    18  p 19 20    18  p 19 20    18  p 19 20
14  p 15  x     x  x  x  x     p 21 22 23     p 21 22 23     p 21 22 23
 p 16 17  x     x  x  x  x    24 25 26  p    24 25 26  p    24 25 26  p
18 19  p  x    18 19  p  x     x  x  x  x    27 28  p 29    27 28  p 29
20  p 21  x    20  p 21  x     x  x  x  x    30  p 31 32    30  p 31 32
 p 22 23  x     p 22 23  x     x  x  x  x     p 33 34 35     p 33 34 35
24 25  p  x    24 25  p  x     x  x  x  x    36 37 38  p    36 37 38  p
26  p 27  x    26  p 27  x    26  p 27  x     x  x  x  x    39 40  p 41
 p 28 29  x     p 28 29  x     p 28 29  x     x  x  x  x    42  p 43 44

In Step 0 and Step 1 the source and destination stripes overlap so a
backup is required. But at Step 2 you have a full stripe to work with
safely, at Step 4 2 stripes are save, Step 6 3 stripes and Step 7 4
stripes. As you go the safe region gets larger and larger requiring
less and less sync points.

Idealy the raid reshape should read as much data from the source
stripes as possible in one go and then write it all out in one
go. Then rince and repeat. For a simple implementation why not do
this:

1) read reshape-sync-size from proc/sys, default to 10% ram size
2) sync-size = min(reshape-sync-size, size of safe region)
3) setup internal mirror between old (read-write) and new stripes (write only)
4) read source blocks into stripe cache
5) compute new parity
6) put stripe into write cache
7) goto 3 until sync-size is reached
8) sync blocks to disk
9) record progress and remove internal mirror
10) goto 1

Optionally in 9 you can skip recording the progress if the safe region
is big enough for another read/write pass.

The important idea behind this would be that, given enough free ram,
there is a large linear read and large linear write alternating. Also,
since the normal cache is used instead of the static stripe cache, if
there is not enough ram then writes will be flushed out prematurely.
This will lead to a degradation of performance but that is better than
running out of memory.

I have 4GB on my desktop with at least 3GB free if I'm not doing
anything expensive. A raid-reshape should be able to do 3GB linear
read and write alternatively. But I would already be happy if it would
do 256MB. There is lots of opportunity for streaming. It might justbe
hard to get the kernel IO system to cooperate.

MfG
        Goswin
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html