Re: RAID5 to RAID6 reshape?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



----- Message from pg_lxra@xxxxxxxxxxxxxxxxxxx ---------
    Date: Mon, 25 Feb 2008 00:10:07 +0000
    From: Peter Grandi <pg_lxra@xxxxxxxxxxxxxxxxxxx>
Reply-To: Peter Grandi <pg_lxra@xxxxxxxxxxxxxxxxxxx>
 Subject: Re: RAID5 to RAID6 reshape?
      To: Linux RAID <linux-raid@xxxxxxxxxxxxxxx>


On Sat, 23 Feb 2008 21:40:08 +0100, Nagilum
<nagilum@xxxxxxxxxxx> said:

[ ... ]

* Doing unaligned writes on a 13+1 or 12+2 is catastrophically
slow because of the RMW cycle. This is of course independent
of how one got to the something like 13+1 or a 12+2.

nagilum> Changing a single byte in a 2+1 raid5 or a 13+1 raid5
nagilum> requires exactly two 512byte blocks to be read and
nagilum> written from two different disks. Changing two bytes
nagilum> which are unaligned (the last and first byte of two
nagilum> consecutive stripes) doubles those figures, but more
nagilum> disks are involved.

Here you are using the astute misdirection of talking about
unaunaligned *byte* *updates* when the issue is unaligned
*stripe* *writes*.

Which are (imho) much less likely to occur than minor changes in a block. (think touch, mv, chown, chmod, etc.)

If one used your scheme to write a 13+1 stripe one block at a
time would take 26R+26W operations (about half of which could be
cached) instead of 14W which are what is required when doing
aligned stripe writes, which is what good file systems try to
achieve.
....
But enough of talking about absurd cases, let's do a good clear
example of why a 13+1 is bad bad bad when doing unaligned writes.

Consider writing to a 2+1 and an 13+1 just 15 blocks in 4+4+4+3
bunches, starting with block 0 (so aligned start, unaligned
bunch length, unaligned total length), a random case but quite
illustrative:

  2+1:
	00 01 P1 03 04 P2 06 07 P3 09 10 P4
        00 01    02 03    04 05    06 07
        ------**-------** ------**-------**
        12 13 P5 15 16 P6 18 19 P7 21 22 P8
        08 09    10 11    12 13    14
        ------**-------** ------**---    **

	write D00 D01 DP1
	write D03 D04 DP2

	write D06 D07 DP3
	write D09 D10 DP4

	write D12 D13 DP5
	write D15 D16 DP6

	write D18 D19 DP7
	read  D21 DP8
	write D21 DP8

        Total:
	  IOP: 01 reads, 08 writes
	  NLK: 02 reads, 23 writes
	  XOR: 28 reads, 15 writes

 13+1:
	00 01 02 03 04 05 06 07 08 09 10 11 12 P1
        00 01 02 03 04 05 06 07 08 09 10 11 12
        ----------- ----------- ----------- -- **

        14 15 16 17 18 19 20 21 22 23 24 25 26 P2
	13 14
	-----                                  **

	read  D00 D01 D02 D03 DP1
	write D00 D01 D02 D03 DP1

	read  D04 D05 D06 D07 DP1
	write D04 D05 D06 D07 DP1

	read  D08 D09 D10 D11 DP1
	write D08 D09 D10 D11 DP1

	read  D12 DP1 D14 D15 DP2
	write D12 DP1 D14 D15 DP2

        Total:
	  IOP: 04 reads, 04 writes
	  BLK: 20 reads, 20 writes
	  XOR: 34 reads, 10 writes

and now the same with cache:

	write D01 D02 D03 D04 D05 D06 D07 D08 D09 D10 D11 D12 D13 DP1
 	read  D14 D15 DP2
 	write D14 D15 DP2
        Total:
	  IOP: 01 reads, 02 writes
	  BLK: 03 reads, 18 writes
XOR: not sure what you're calculating here, but it's mostly irrelevant anyway, even my old Athlon500MHz can XOR >2.6GB/s iirc.

The short stripe size means that one does not need to RMW in
many cases, just W; and this despite that much higher redundancy
of 2+1. it also means that there are lots of parity blocks to
compute and write. With a 4 block operation length a 3+1 or even
more a 4+1 would be flattered here, but I wanted to exemplify
two extremes.

With a write cache the picture looks a bit better. If the writes happen close enough together (temporal) they will be joined. If they are further apart chances are the write speed is not that critical anyway.

The narrow parallelism thus short stripe length of 2+1 means
that a lot less blocks get transferred because of almost no RM,
but it does 9 IOPs and 13+1 does one less at 8 (wider
parallelism); but then the 2+1 IOPs are mostly in back-to-back
write pairs, while the 13+1 are in read-rewrite pairs, which is
a significant disadvantage (often greatly underestimated).

Never mind that the number of IOPs is almost the same despite
the large difference in width, and that can do with the same
disks as a 13+1 something like 4 2+1/3+1 arrays, thus gaining a
lot of parallelism across threads, if there is such to be
obtained. And if one really wants to write long stripes, one
should use RAID10 of course, not long stripes with a single (or
two) parity blocks.


Never mind that finding the chances of putting in the IO request
stream a set of back-to-back logical writes to 13 contiguous
blocks aligned starting on a 13 block multiple are bound to be
lower than those of get a set of of 2 or 3 blocks, and even
worse with a filesystem mostly built for the wrong stripe
alignment.

I have yet to be convinced this difference is that significant.
I think most changes are updates of file attributes (e.g. atime).
File reads will perform better when spread over more disks.
File writes usually write the whole file so it directly depends on your filesizes most of which are usually <1k. If this is for a digital attic the media files will be in the many MB range. Both are equally good or bad for the described scenarios.
The advantage is limited to a certain window of file writes.
The size of that window depends on the number of disks just as much as it depends on the chunk size. Depending on the individual usage scenario one or the other window is better suited.

* Unfortunately the frequency of unaligned writes *does*
  usually depend on how dementedly one got to the 13+1 or
  12+2 case: because a filesystem that lays out files so that
  misalignment is minimised with a 2+1 stripe just about
  guarantees that when one switches to a 3+1 stripe all
  previously written data is misaligned, and so on -- and
  never mind that every time one adds a disk a reshape is
  done that shuffles stuff around.

In general large chunksizes are not such a brilliant idea, even
if ill-considered benchmarks may show some small advantage with
somewhat larger chunksizes.

Yeah.

My general conclusion is that reshapes are a risky, bad for
performance, expensive operation that is available, like RAID5
in general (and especially RAID5 above 2+1 or in a pinch 3+1)
only for special cases when one cannot do otherwise and knows
exactly what the downside is (which seems somewhat rare).

Agreed, but performance is still acceptable albeit not optimal.

I think that defending the concept of growing a 2+1 into a 13+1
via as many as 11 successive reshapes is quite ridiculous, even
more so when using fatuous arguments about 1 or 2 byte updates.

I don't know why you don't like the example. How many bytes change for an atime update?

It is even worse than coming up with that idea itself, which is
itself worse than that of building a 13+1 to start with.

The advantage is economically. One buys a few disks now and continuous to stack up over the course of the years as storage need increases.
But I wouldn't voluntarily do a raid5 with more than 8 disks too.
Kind regards,

----- End message from pg_lxra@xxxxxxxxxxxxxxxxxxx -----



========================================================================
#    _  __          _ __     http://www.nagilum.org/ \n icq://69646724 #
#   / |/ /__ ____ _(_) /_ ____ _  nagilum@xxxxxxxxxxx \n +491776461165 #
#  /    / _ `/ _ `/ / / // /  ' \  Amiga (68k/PPC): AOS/NetBSD/Linux   #
# /_/|_/\_,_/\_, /_/_/\_,_/_/_/_/   Mac (PPC): MacOS-X / NetBSD /Linux #
#           /___/     x86: FreeBSD/Linux/Solaris/Win2k  ARM9: EPOC EV6 #
========================================================================


----------------------------------------------------------------
cakebox.homeunix.net - all the machine one needs..

Attachment: pgppaYaU6V5DH.pgp
Description: PGP Digital Signature


[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux