----- Message from pg_lxra@xxxxxxxxxxxxxxxxxxx --------- Date: Mon, 25 Feb 2008 00:10:07 +0000 From: Peter Grandi <pg_lxra@xxxxxxxxxxxxxxxxxxx> Reply-To: Peter Grandi <pg_lxra@xxxxxxxxxxxxxxxxxxx> Subject: Re: RAID5 to RAID6 reshape? To: Linux RAID <linux-raid@xxxxxxxxxxxxxxx>
On Sat, 23 Feb 2008 21:40:08 +0100, Nagilum <nagilum@xxxxxxxxxxx> said:[ ... ]* Doing unaligned writes on a 13+1 or 12+2 is catastrophically slow because of the RMW cycle. This is of course independent of how one got to the something like 13+1 or a 12+2.nagilum> Changing a single byte in a 2+1 raid5 or a 13+1 raid5 nagilum> requires exactly two 512byte blocks to be read and nagilum> written from two different disks. Changing two bytes nagilum> which are unaligned (the last and first byte of two nagilum> consecutive stripes) doubles those figures, but more nagilum> disks are involved. Here you are using the astute misdirection of talking about unaunaligned *byte* *updates* when the issue is unaligned *stripe* *writes*.
Which are (imho) much less likely to occur than minor changes in a block. (think touch, mv, chown, chmod, etc.)
If one used your scheme to write a 13+1 stripe one block at a time would take 26R+26W operations (about half of which could be cached) instead of 14W which are what is required when doing aligned stripe writes, which is what good file systems try to achieve. .... But enough of talking about absurd cases, let's do a good clear example of why a 13+1 is bad bad bad when doing unaligned writes. Consider writing to a 2+1 and an 13+1 just 15 blocks in 4+4+4+3 bunches, starting with block 0 (so aligned start, unaligned bunch length, unaligned total length), a random case but quite illustrative: 2+1: 00 01 P1 03 04 P2 06 07 P3 09 10 P4 00 01 02 03 04 05 06 07 ------**-------** ------**-------** 12 13 P5 15 16 P6 18 19 P7 21 22 P8 08 09 10 11 12 13 14 ------**-------** ------**--- ** write D00 D01 DP1 write D03 D04 DP2 write D06 D07 DP3 write D09 D10 DP4 write D12 D13 DP5 write D15 D16 DP6 write D18 D19 DP7 read D21 DP8 write D21 DP8 Total: IOP: 01 reads, 08 writes NLK: 02 reads, 23 writes XOR: 28 reads, 15 writes 13+1: 00 01 02 03 04 05 06 07 08 09 10 11 12 P1 00 01 02 03 04 05 06 07 08 09 10 11 12 ----------- ----------- ----------- -- ** 14 15 16 17 18 19 20 21 22 23 24 25 26 P2 13 14 ----- ** read D00 D01 D02 D03 DP1 write D00 D01 D02 D03 DP1 read D04 D05 D06 D07 DP1 write D04 D05 D06 D07 DP1 read D08 D09 D10 D11 DP1 write D08 D09 D10 D11 DP1 read D12 DP1 D14 D15 DP2 write D12 DP1 D14 D15 DP2 Total: IOP: 04 reads, 04 writes BLK: 20 reads, 20 writes XOR: 34 reads, 10 writes
and now the same with cache: write D01 D02 D03 D04 D05 D06 D07 D08 D09 D10 D11 D12 D13 DP1 read D14 D15 DP2 write D14 D15 DP2 Total: IOP: 01 reads, 02 writes BLK: 03 reads, 18 writesXOR: not sure what you're calculating here, but it's mostly irrelevant anyway, even my old Athlon500MHz can XOR >2.6GB/s iirc.
The short stripe size means that one does not need to RMW in many cases, just W; and this despite that much higher redundancy of 2+1. it also means that there are lots of parity blocks to compute and write. With a 4 block operation length a 3+1 or even more a 4+1 would be flattered here, but I wanted to exemplify two extremes.
With a write cache the picture looks a bit better. If the writes happen close enough together (temporal) they will be joined. If they are further apart chances are the write speed is not that critical anyway.
The narrow parallelism thus short stripe length of 2+1 means that a lot less blocks get transferred because of almost no RM, but it does 9 IOPs and 13+1 does one less at 8 (wider parallelism); but then the 2+1 IOPs are mostly in back-to-back write pairs, while the 13+1 are in read-rewrite pairs, which is a significant disadvantage (often greatly underestimated). Never mind that the number of IOPs is almost the same despite the large difference in width, and that can do with the same disks as a 13+1 something like 4 2+1/3+1 arrays, thus gaining a lot of parallelism across threads, if there is such to be obtained. And if one really wants to write long stripes, one should use RAID10 of course, not long stripes with a single (or two) parity blocks.
Never mind that finding the chances of putting in the IO request stream a set of back-to-back logical writes to 13 contiguous blocks aligned starting on a 13 block multiple are bound to be lower than those of get a set of of 2 or 3 blocks, and even worse with a filesystem mostly built for the wrong stripe alignment.
I have yet to be convinced this difference is that significant. I think most changes are updates of file attributes (e.g. atime). File reads will perform better when spread over more disks.File writes usually write the whole file so it directly depends on your filesizes most of which are usually <1k. If this is for a digital attic the media files will be in the many MB range. Both are equally good or bad for the described scenarios.
The advantage is limited to a certain window of file writes.The size of that window depends on the number of disks just as much as it depends on the chunk size. Depending on the individual usage scenario one or the other window is better suited.
* Unfortunately the frequency of unaligned writes *does* usually depend on how dementedly one got to the 13+1 or 12+2 case: because a filesystem that lays out files so that misalignment is minimised with a 2+1 stripe just about guarantees that when one switches to a 3+1 stripe all previously written data is misaligned, and so on -- and never mind that every time one adds a disk a reshape is done that shuffles stuff around.In general large chunksizes are not such a brilliant idea, even if ill-considered benchmarks may show some small advantage with somewhat larger chunksizes.
Yeah.
My general conclusion is that reshapes are a risky, bad for performance, expensive operation that is available, like RAID5 in general (and especially RAID5 above 2+1 or in a pinch 3+1) only for special cases when one cannot do otherwise and knows exactly what the downside is (which seems somewhat rare).
Agreed, but performance is still acceptable albeit not optimal.
I think that defending the concept of growing a 2+1 into a 13+1 via as many as 11 successive reshapes is quite ridiculous, even more so when using fatuous arguments about 1 or 2 byte updates.
I don't know why you don't like the example. How many bytes change for an atime update?
It is even worse than coming up with that idea itself, which is itself worse than that of building a 13+1 to start with.
The advantage is economically. One buys a few disks now and continuous to stack up over the course of the years as storage need increases.
But I wouldn't voluntarily do a raid5 with more than 8 disks too. Kind regards, ----- End message from pg_lxra@xxxxxxxxxxxxxxxxxxx ----- ======================================================================== # _ __ _ __ http://www.nagilum.org/ \n icq://69646724 # # / |/ /__ ____ _(_) /_ ____ _ nagilum@xxxxxxxxxxx \n +491776461165 # # / / _ `/ _ `/ / / // / ' \ Amiga (68k/PPC): AOS/NetBSD/Linux # # /_/|_/\_,_/\_, /_/_/\_,_/_/_/_/ Mac (PPC): MacOS-X / NetBSD /Linux # # /___/ x86: FreeBSD/Linux/Solaris/Win2k ARM9: EPOC EV6 # ======================================================================== ---------------------------------------------------------------- cakebox.homeunix.net - all the machine one needs..
Attachment:
pgppaYaU6V5DH.pgp
Description: PGP Digital Signature