Re: raid10 layouts and performance Re: md man page

Keld Jørn Simonsen <keld@xxxxxxxx> · Wed, 9 Jul 2008 00:44:22 +0200

On Tue, Jul 08, 2008 at 09:29:08AM +1000, Neil Brown wrote:
Content-Description: message body text
> 
> (Adding linux-raid - I hope that's OK Keld?)

Yeah, that is fine:-)

> On Wednesday July 2, keld@xxxxxxxx wrote:
> > 
> >        When 'offset' replicas are chosen, the multiple copies of a given chunk
> >        are  laid out on consecutive drives and at consecutive offsets.  Effec-
> >        tively each stripe is duplicated and  the  copies  are  offset  by  one
> >        device.    This  should give similar read characteristics to 'far' if a
> >        suitably large chunk size is used, but  without  as  much  seeking  for
> >        writes.
> > 
> > A number of benchmarks have shown that 'offset' layout does not have 
> > similar read characteristics as the 'far' layout. Also a number of benchmarks have
> > shown that seeking is similar in 'far' and 'offset' layouts. So I suggest to
> > remove the last sentence.
> 
> If I have done any such benchmarks, it was too long ago to remember,
> so I decided to do some simple tests and graph them.  I like graphs
> and I like this one so I've decided to share it.

I like graphs too! May I use your graph on the wiki?

> The X axis is chunk size, ranging from 4k to 4096k - it is
> logarithmic.
> The Y axis is throughput in MB/s measured by 'dd' to the raw device -
> average of 5 runs.
> This was with a 2-drive raid with each of the possible layout: n2, f2,
> o2.
> 
> f2-read is strikingly faster than anything else.  It is clearly
> reading from both drives as once, as you would expect it to.
> f2-write is slower then anything else (except at 4K chunk size, which is
> an extreme case).

Yes, in your test. Is this done with dd on the raw array?

My tests indicate that writing is almost the same for raid10,n2 and
raid10,f2, when using the ext3 fs. I think the elevator comes into play
here. And I actually think this is important. You do not use an array
without a fs on top of it. And for the user, it is really the resulting
performance of the raid and the fs that is interesting. The raw array is
not that interesting.

> o2-read is fairly steady for most of the chunk sizes, but peaks up at
> 2M and only drops a little at 4M.  This seems to suggest that it is
> around 2M that the time to seek over a chunk drops well below the time
> to read one chunk.  Possibly at smaller chunk sizes, it just reads to
> skip N sectors.  Maybe the cylinder size is about 2Meg - there no real
> gain from the offset layout until you can seek over whole cylinders.
> So the sentence:
>       This  should give similar read characteristics to 'far' if a
>       suitably large chunk size is used
> seems somewhat justified if the chunksize used is 2M.

Your graph indicates that raid10,o2 is something like 20 % under the
performance of raid10,f2, in the best case. In the worst case it is
about 30 % under. To me this is not "similar". To me that is better
described as a performance of 20 - 30 % under that of raid10,f2.

> It might be interesting to implement non-power-of-2 chunksizes and try
> a range of sizes between 1M and 4M to see what the graph looks like...
> maybe we could find the actual cylinder size.
> 
> o2-write is very close to n2-write and is measurably (8%-14%) higher
> than f2-write.  This seems to support the sentence
>       but without as much seeking for writes.
> 
> It is not that there are fewer seeks, but that the seeks are shorter.

This is most likely compensated by the elevator, as described above.

> So while I don't want to just remove that last sentence, I agree that
> it could be improved, possibly by giving a ball-park figure for what a
> "suitably large chunk size" is.  Also the second half could be
> "but without the long seeks being required for sequential writes".
> 
> It would probably be good to do some measurements with random IO as
> well to see how they compare.
> 
> Anyone else have some measurements they would like to share?

There is something like more than a handful in the wiki at
http://linux-raid.osdl.org/index.php/Performance

This includes some tests for random IO.

> Thanks for your suggestions.

you are welcome!

In my quest for updated documentation for linux raid, I find that mdadm
documentation is also very outdated. The mdadm man page that is reported
by google, and on wikipedia for mdadm, do not include any info on
raid10! 

Is there a page that we could reference, which has the current mdadm man
page? And which is maintained?

I note that our raid wiki is now nbr 3 on google, That is a lot better
than number 121 which was the place about half a year ago:-)

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html