Re: Suboptimal raid6 linear read speed

pg@xxxxxxxxxxxxxxxxxxxx (Peter Grandi) · Mon, 21 Jan 2013 20:32:00 +0000

[ ... RAID6 reading suffering from "skipping" over parity blocks
... ]

> I was indeed very surprised to find out that the skipping is
> *not* free.

Uhmmm :-).

> I am planning to do some research on whether it is possible to
> use specific chunksizes so that when laid out on top of the
> physical media the "skip penalty" is minimized. [ ... ]

Oh no, that does not address the issue. The issue is that
whatever happens in an N+M RAID[56...], a fraction M/N of the
data is not relevant for reading, and that if one does a simple
mapping M/N of the drives won't be "active" at any one point.
Since usually (and unfortunately) N>>M (that is, very wide
RAID[56...] sets) not many people worry about that.

The only "solution" that seems practical to me is a "far" layout
as in the MD RAID10, generalizing it, that is laying data and
"parity" blocks not across the drives, but along them; for the
trivial and not so clever case, to put the "parity" blocks on
the same drive(s) as the stripe they relate to. But obviously
that does not have redundancy, so like in the MD RAID10 "far"
layout, the idea is to move stagger/diagonalize them to the next
drive.

For example, in a 2+2 layout like yours, each drive is divided
in two regions, the top one for data, the bottom one for parity,
and the first two stripes are laid out like this:

   A       B       C       D
----------------------------
[0:0     0:1]   [1:0     1:1]
[2:0     2:1]   [3:0     3:1]
....    ....    ....    ....
....    ....    ....    ....
----------------------------
....    ....    ....    ....
....    ....    ....    ....
[3:P     3:Q]   [2:P     2:Q]
[1:P     1:Q]   [0:P     0:Q]
----------------------------

and the 4+1 case would be:

   A       B       C       D       E
------------------------------------
[0:0     0:1     0:2     0:3]   [1:0
 1:1     1:2     1:3]   [2:0     2:1
 2:2     2:3]   [3:0     3:1     3:2
 3:3]   [4:0     4:1     4:2     4:3]
....    ....    ....    ....    ....
....    ....    ....    ....    ....
....    ....    ....    ....    ....
....    ....    ....    ....    ....
------------------------------------
....    ....    ....    ....    ....
[4:P]   [3:P]   [2:P]  [1:P]    [0:P]
------------------------------------

and for "fun" this is the 3+3 case:

  A       B       C       D       E       F
--------------------------------------------
[0:1     0:1     0:2]   [1:0     1:1     1:2]
[2:0     2:1     2:2]   [3:0     3:1     3:2]
....    ....    ....    ....    ....    ....
....    ....    ....    ....    ....    ....
--------------------------------------------
....    ....    ....    ....    ....    ....
....    ....    ....    ....    ....    ....
[3:P     3:Q     3:R]  [2:P     2:Q     2:R]
[1:P     1:Q     1:R]  [0:P     0:Q     0:R]
--------------------------------------------

More generally, given a N data blocks and M "parity" blocks per
stripe, each drive is divided two areas, a data one N/(N+M) of
the disk capacity, a "parity" one M/(N+M) of the disk capacity
and:

  - The N blocks long data parts of each stripe are written
    consecutively in the data areas across the N+M RAID set
    members.
  - The M blocks long "parity" parts of each stripe are written
    consecutively *backwards* from the *end* of the "parity"
    areas across the N+M RAID set members.

This ensures that each N and M blocks long parts of each stripe
are written on different disks, staggered neatly.

The ordinary layout is one in which one divides the stripes in
one section only. One can generalize with other stripe and block
within stripe distribution functions too, including those that
might subdivide a stripe (and thus each disk in the RAID set) in
more than 2 parts, but I don't see much point in that, except
for RAID1 1+N where for example each of the N can be considered
an indepedent "parity".

  Note: RAID10 "far" should also write the mirror from the end,
  and arguably RAID1 should mirror backwards too (woudl give
  nicely uniform average IO rates depsite the speed difference
  between inner and outer tracks. Of course one could write the
  "parity" section of each stripe backwards in each row starting
  with the first row, but I like better the fully inverted
  layout...  But it may not be practical with disk scheduling
  algorithms that probably prefer forwards seeking.

But just as "far" RAID10 pays for better single threaded reading
with slower writing, "far" RAID[56...] (or any other, as the
"far" layout is easy to generalize) pays for greater reading
speed in the optimal case with even more terrible writing or
incomplete or resync reading, because to read two consecutive
stripe requires seeking across some of the disks.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html