On Tue, Nov 3, 2009 at 5:07 AM, Goswin von Brederlow <goswin-v-b@xxxxxx> wrote: > "NeilBrown" <neilb@xxxxxxx> writes: > >> A reshape is a fundamentally slow operation. Each block needs to >> be read and then written somewhere else so there is little opportunity >> for streaming. >> An in-place reshape (i.e the array doesn't get bigger or smaller) is >> even slower as we have to take a backup copy of each range of blocks >> before writing them back out. This limits streaming even more. >> >> It is possible to get it fast than it is by increasing the >> array's stripe_cache_size and also increasing the 'backup' size >> that mdadm uses. mdadm-3.1.1 will try to do better in this respect. >> However it will still be significantly slower than e.g. a resync. >> >> So reshape will always be slow. It is a completely different issue >> to filesystem activity on a RAID array being slow. Recent reports of >> slowness are, I think, not directly related to md/raid. It is either >> the filesystem or the VM or a combination of the two that causes >> these slowdowns. >> >> >> NeilBrown > > Now why is that? Lets leave out the case of an in-place > reshape. Nothing can be done to avoid making a backup of blocks > there, which severly limits the speed. > > But the most common case should be growing an array. Lets look at the > first few steps or 3->4 disk raid5 reshape. Each step denotes a point > where a sync is required: > > Step 0 Step 1 Step 2 Step 3 Step 4 > A B C D A B C D A B C D A B C D A B C D > 00 01 p x 00 01 02 p 00 01 02 p 00 01 02 p 00 01 02 p > 02 p 03 x 02 p 03 x 03 04 p 05 03 04 p 05 03 04 p 05 > p 04 05 x p 04 05 x x x x x 06 p 07 08 06 p 07 08 > 06 07 p x 06 07 p x 06 07 p x x x x x p 09 10 11 > 08 p 09 x 08 p 09 x 08 p 09 x 08 p 09 x x x x x > p 10 11 x p 10 11 x p 10 11 x p 10 11 x x x x x > 12 13 p x 12 13 p x 12 13 p x 12 13 p x 12 13 p x > 14 p 15 x 14 p 15 x 14 p 15 x 14 p 15 x 14 p 15 x > p 16 17 x p 16 17 x p 16 17 x p 16 17 x p 16 17 x > 18 19 p x 18 19 p x 18 19 p x 18 19 p x 18 19 p x > 20 p 21 x 20 p 21 x 20 p 21 x 20 p 21 x 20 p 21 x > p 22 23 x p 22 23 x p 22 23 x p 22 23 x p 22 23 x > 24 25 p x 24 25 p x 24 25 p x 24 25 p x 24 25 p x > 26 p 27 x 26 p 27 x 26 p 27 x 26 p 27 x 26 p 27 x > p 28 29 x p 28 29 x p 28 29 x p 28 29 x p 28 29 x > > Step 5 Step 6 Step 7 Step 8 Step 9 > A B C D A B C D A B C D A B C D A B C D > 00 01 02 p 00 01 02 p 00 01 02 p 00 01 02 p 00 01 02 p > 03 04 p 05 03 04 p 05 03 04 p 05 03 04 p 05 03 04 p 05 > 06 p 07 08 06 p 07 08 06 p 07 08 06 p 07 08 06 p 07 08 > p 09 10 11 p 09 10 11 p 09 10 11 p 09 10 11 p 09 10 11 > x x x x 12 13 14 p 12 13 14 p 12 13 14 p 12 13 14 p > x x x x 15 16 p 17 15 16 p 17 15 16 p 17 15 16 p 17 > 12 13 p x x x x x 18 p 19 20 18 p 19 20 18 p 19 20 > 14 p 15 x x x x x p 21 22 23 p 21 22 23 p 21 22 23 > p 16 17 x x x x x 24 25 26 p 24 25 26 p 24 25 26 p > 18 19 p x 18 19 p x x x x x 27 28 p 29 27 28 p 29 > 20 p 21 x 20 p 21 x x x x x 30 p 31 32 30 p 31 32 > p 22 23 x p 22 23 x x x x x p 33 34 35 p 33 34 35 > 24 25 p x 24 25 p x x x x x 36 37 38 p 36 37 38 p > 26 p 27 x 26 p 27 x 26 p 27 x x x x x 39 40 p 41 > p 28 29 x p 28 29 x p 28 29 x x x x x 42 p 43 44 > > > In Step 0 and Step 1 the source and destination stripes overlap so a > backup is required. But at Step 2 you have a full stripe to work with > safely, at Step 4 2 stripes are save, Step 6 3 stripes and Step 7 4 > stripes. As you go the safe region gets larger and larger requiring > less and less sync points. > > Idealy the raid reshape should read as much data from the source > stripes as possible in one go and then write it all out in one > go. Then rince and repeat. For a simple implementation why not do > this: > > 1) read reshape-sync-size from proc/sys, default to 10% ram size > 2) sync-size = min(reshape-sync-size, size of safe region) > 3) setup internal mirror between old (read-write) and new stripes (write only) > 4) read source blocks into stripe cache > 5) compute new parity > 6) put stripe into write cache > 7) goto 3 until sync-size is reached > 8) sync blocks to disk > 9) record progress and remove internal mirror > 10) goto 1 > > Optionally in 9 you can skip recording the progress if the safe region > is big enough for another read/write pass. > > The important idea behind this would be that, given enough free ram, > there is a large linear read and large linear write alternating. Also, > since the normal cache is used instead of the static stripe cache, if > there is not enough ram then writes will be flushed out prematurely. > This will lead to a degradation of performance but that is better than > running out of memory. > > I have 4GB on my desktop with at least 3GB free if I'm not doing > anything expensive. A raid-reshape should be able to do 3GB linear > read and write alternatively. But I would already be happy if it would > do 256MB. There is lots of opportunity for streaming. It might justbe > hard to get the kernel IO system to cooperate. > > MfG > Goswin > Skimming your message I agree with the major points, however you're only considering the best case scenario (which is how it probably should run for performance). There is also the worst-case scenario where a device, driver, OS, or even power (supply lets say) fails in mid-operation. If there isn't a gap created due to the reshape (obviously it would continue to grow the more the reshape proceeds) then it's still an 'in place' operation (which I argue should be done in the largest block possible within memory, but with data backed up on a device). Growing operations obviously have free space on the new device, and further as the operation proceeds there will be a growing gap between the re-written data and the old copy of the data. Shrinking operations, counter-intuitively, also have a growing area of free space; at the end of the device. Working backwards, after a given number of stripes, the operation should be just as safe, if in reverse, as a normal grow. In any of the three cases, the largest possible write window per device should be used to take advantage of the usual gains in speed. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html