Re: XFS on top RAID10 with odd drives count and 2 near copies

David Brown <david@xxxxxxxxxxxxxxx> · Fri, 17 Feb 2012 15:57:31 +0100

On 17/02/2012 14:16, Stan Hoeppner wrote:
On 2/15/2012 9:40 AM, David Brown wrote:

Like Robin said, and like I said in my earlier post, the second copy is
on a different disk.

We've ended up too deep in the mud here.  Keld's explanation didn't make
sense resulting in my "huh" reply.  Let's move on from there back to the
real question.

You guys seem to assume that since I asked a question about the near,far
layouts that I'm ignorant of them.  These layouts are the SNIA
integrated adjacent stripe and offset stripe mirroring.  They are well
known.  This is not what I asked about.

As far as I can see (from the SNIA DDF Technical Position v. 2.0), md 
raid10,n2 is roughly SNIA RAID-1E "integrated adjacent stripe 
mirroring", while raid10,o2 (offset layout) is roughly SNIA RAID-1E 
"Integrated offset stripe mirroring".  I say roughly, because I don't 
know if SNIA covers raid10 with only 2 disks, and I am not 100% sure 
about whether the choice of which disk mirrors which other disk is the same.

I can't see any SNIA level that remotely matches md raid10,far layout.

As far as I can see, you are the only one in this thread who doesn't
understand this.  I'm not sure where the problem lies, as several people
(including me) have given you explanations that seem pretty clear to me.
  But maybe there is some fundamental point that we are assuming is
obvious, but you don't get - hopefully it will suddenly click in place
for you.

Again, the problem is you're assuming I'm ignorant of the subject, and
are simply repeating the boiler plate.

Forget writes for a moment.[snip]

This saga is all about writes.  The fact you're running away from writes
may be part of the problem.

The whole point of raid10,far is to improve read speed compared to other 
layouts - even though it is slower for writes.  Obviously you /can/ do 
writes, and obviously they are safe and mirrored - but for this 
read-heavy application the speed of writes should not be the main issue. 
 The point is that raid10,far will give faster /reads/ than other 
layouts.  No one is "running away" from writes - I am just putting them 
aside to help the explanation.

Back to the original issue.  Coolcold and I were trying to figure out
what the XFS write stripe alignment should be for a 7 disk mdraid10 near
layout array.

That is certainly one issue - and it's something you know a lot more 
about than me.  So I am not getting involved in that (but I am listening 
in and learning).

But I can't sit by idly while you discuss details of the xfs striping 
over raid10,near when I believe a change to raid10,far will make a lot 
bigger difference to this read-heavy application.

After multiple posts from David, Robin, and Keld attempting to 'educate'
me WRT the mdraid driver read tricks which yield an "effective RAID0
stripe", nobody has yet answered my question:

What is the stripe spindle width of a 7 drive mdraid near array?

With "near" layout, it is basically 3.5 spindles.  raid10,n2 is the same 
layout as normal raid10 if the number of disks is a multiple of 2.  (See 
later before you react to the "3.5 spindles".)

With "far" or "offset" layout it is clearly 7 spindles.

As you say, md raid10 gives an "effective raid0 stripe" for offset and 
far layouts.

The difference with raid10,far compared to raid10,offset is that each of 
these raid0 stripe reads comes from the fastest half of the disk, with 
minimal head movement (while reading), and with better use of disk 
read-ahead.

Do note that stripe width is specific to writes.  It has nothing to do
with reads, from the filesystem perspective anyway.  For internal array
operations it will.

I don't understand that at all.

To my mind, stripe width applies to reads and writes.  For reads, it is 
the number of spindles that are used in parallel while reading larger 
blocks of data.  For writes, it is in addition the width of a parity 
stripe for raid5 or raid6.

Normally, the filesystem does not care about stripe widths, either for 
reading or writing, just as it does not care whether you have one disk, 
an array, local disks, iSCSI disks, or whatever.  Some filesystems care 
a /little/ about stripe width in that they align certain structures to 
stripe boundaries to make accesses more efficient.

So lets take a look at two 4 drive RAIDs, a standard RAID10 and a
RAID10,n/f.  The standard RAID10 array has a stripe across two drives.
Each drive has a mirror.  Stripe writes are two device wide.  There are
a total of 4 write operations to the drives, 2 data and two mirror data.
  Stripe width concerns only data.

Fine so far.  In pictures, we have this:

Given data blocks 0, 1, 2, 3, ...., with copies "a" and "b", you have:

Standard raid10:

disk0 = 0a 2a 4a 6a 8a
disk1 = 0b 2b 4b 6b 8b
disk2 = 1a 3a 5a 7a 9a
disk3 = 1b 3b 5b 7b 9b

The stripe width is 2 - if you try to do a large read, you will get data 
from two drives in parallel.

Small writes (a single chunk) will involve 2 write operations - one to 
the "a" copy, and one to the "b" copy of each block, and will be done in 
parallel as they are on different disks.  Large writes will also be two 
copies, and will go to all disks in parallel.

"raid10,n2" layout is exactly the same as standard "raid10" - i.e., a 
stripe of mirrors - when there is a multiple of 2 disks.  For seven 
disks, the layout would be:

disk0 = 0a 3b 7a
disk1 = 0b 4a 7b
disk2 = 1a 4b 8a
disk3 = 1b 5a 8b
disk4 = 2a 5b 9a
disk5 = 2b 6a 9b
disk6 = 3a 6b 10a

The n,r rotate the data and mirror data writes around the 4 drives.  So
it is possible, and I assume this is the case, to write data and mirror
data 4 times, making the stripe width 4, even though this takes twice as
many RAID IOs compared to the standard RAID10 lyout.  If this is the
case this is what we'd tell mkfs.xfs.  So in the 7 drive case it would
be seven.  This is the only thing I'm unclear about WRT the near/far
layouts, thus my original question.  I believe Neil will be definitively
answering this shortly.

I think you are probably right here - it doesn't make sense to talk 
about a "3.5" spindle width.  If you call it 7, then it should work well 
even though each write takes two operations.

Let me draw the pictures of 4 and 7 disk layouts for raid10,f2 (far) and 
raid10,o2 (offset) to show what is going on:

Raid10,offset:

disk0 = 0a 3b 4a 7b 8a  11b
disk1 = 1a 0b 5a 4b 9a  8b
disk2 = 2a 1b 6a 5b 10a 9b
disk3 = 3a 2b 7a 6b 11a 10b

disk0 = 0a 6b 7a  13b
disk1 = 1a 0b 8a  7b
disk2 = 2a 1b 9a  8b
disk3 = 3a 2b 10a 9b
disk4 = 4a 3b 11a 10b
disk5 = 5a 4b 12a 11b
disk6 = 6a 5b 13a 12b

As you can guess, this gives good read speeds (7 spindles in parallel, 
though not ideal read-ahead usage), and writes speeds are also good 
(again, all 7 spindles can be used in parallel, and head movement 
between the two copies is minimal).  This layout is faster than standard 
raid10 or raid10,n2 in most use cases, though for lots of small parallel 
accesses (where striped reads don't occur) there will be no difference.

Raid10,far:

disk0 = 0a 4a 8a  ... 3b 7b 11b ...
disk1 = 1a 5a 9a  ... 0b 4b 8b  ...
disk2 = 2a 6a 10a ... 1b 5b 9b  ...
disk3 = 3a 7a 11a ... 2b 6b 10b ...

disk0 = 0a 7a  ... 6b 13b ...
disk1 = 1a 8a  ... 0b 7b  ...
disk2 = 2a 9a  ... 1b 8b  ...
disk3 = 3a 10a ... 2b 9b  ...
disk4 = 4a 11a ... 3b 10b ...
disk5 = 5a 12a ... 4b 11b ...
disk6 = 6a 13a ... 5b 12b ...

This gives optimal read speeds (7 spindles in parallel, ideal read-ahead 
usage, and all data taken from the faster half of the disks).  Writes 
speeds are not bad (again, all 7 spindles can be used in parallel, but 
you have large head movements between writing each copy of the data). 
For reads, this layout is faster than standard raid10, raid10,n2, 
raid10,o2, and even standard raid0 (since the average bandwidth is 
higher on the outer halves, and the average head movement during read 
seeks is lower).  But writes have longer latencies.

When you are dealing with multiple parallel small reads, much of the 
differences here disappear.  But there is still nothing to lose by using 
raid10,far if you have read-heavy applications - and the shorter head 
movements will still make it faster.  If the longer write operations are 
a concern, raid10,offset may be a better compromise - it is certainly 
still better than raid10,near.

There is a potential problem with this though, if my assumption about
write behavior of n/f is correct.  We've now done 8 RAID IOs to the 4
drives in a single RAID operation.  There should only be 4 RAID IOs in
this case, one to each disk.  This tends to violate some long accepted
standards/behavior WRT RAID IO write patterns.  Traditionally, one RAID
IO meant only one set of sector operations per disk, dictated by the
chunk/strip size.  Here we'll have twice as many, but should
theoretically also be able to push twice as much data per RAID write
operation since our stripe width would be doubled, negating the double
write IOs.  I've not tested these head to head myself.  Such results
with a high IOPS random write workload would be interesting.

Most of my comments here are based on understanding the theory, rather 
than the practice - it's been a while since I did any benchmarking with 
different layouts and that was not very scientific testing.  I certainly 
agree it would be interesting to see test results.

I can't say if the extra writes will be an issue - it may conceivably 
affect speeds if the filesystem is optimised on the assumption that a 
write to 7 spindles means only 7 head movements and 7 write operations. 
 But this is the same issue as you always get with layered raid - 
logically speaking, Linux raid10 (regardless of layout) appears as a 
stripe of mirrors just like traditional layered raid10.

mvh.,

David
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html