Re: Optimal settings for RAID controller - optimized for writes

KONDO Mitsumasa <kondo.mitsumasa@xxxxxxxxxxxxx> · Thu, 20 Feb 2014 17:29:12 +0900

Hi,

(2014/02/20 9:13), Tomas Vondra wrote:
Hi,

On 19.2.2014 03:45, KONDO Mitsumasa wrote:
(2014/02/19 5:41), Tomas Vondra wrote:
On 18.2.2014 02:23, KONDO Mitsumasa wrote:
Hi,

I don't have PERC H710 raid controller, but I think he would like to
know raid striping/chunk size or read/write cache ratio in
writeback-cache setting is the best. I'd like to know it, too:)

The stripe size is actually a very good question. On spinning drives it
usually does not matter too much - unless you have a very specialized
workload, the 'medium size' is the right choice (AFAIK we're using 64kB
on H710, which is the default).

I am interested that raid stripe size of PERC H710 is 64kB. In HP
raid card, default chunk size is 256kB. If we use two disks with raid
0, stripe size will be 512kB. I think that it might too big, but it
might be optimized in raid card... In actually, it isn't bad in that
settings.

With HP controllers this depends on RAID level (and maybe even
controller). Which HP controller are you talking about? I have some
basic experience with P400/P800, and those have 16kB (RAID6), 64kB
(RAID5) or 128kB (RAID10) defaults. None of them has 256kB.
> See http://bit.ly/1bN3gIs (P800) and http://bit.ly/MdsEKN (P400).
I use P410 and P420 that are equiped in DL360 gen7 and DL360gen8. They
seems relatively latest. I check raid stripe size(RAID1+0) using hpacucli tool,
and it is surely 256kB chunk size. And P420 enables to set higher/smaller chunk 
sizes which range is 8KB - 1024kB? higher. But I don't know the best parameter in 
postgres:(

I'm interested in raid card internal behavior. Fortunately, linux raid
card driver is open souce, so we might good at looking the source code
when we have time.

What do you mean by "linux raid card driver"? Afaik the admin tools may
be available, but the interesting stuff happens inside the controller,
and that's still proprietary.
I said open source driver. HP drivers are in under following url.
http://cciss.sourceforge.net/

However, unless I read driver source code roughly, core part of raid card 
programing is in farmware, as you say. It seems that just to drive from OS.
I'm interested in elevetor algorithm, when I read driver source code.
But detail algorithm might be in firmware..

With SSDs this might actually matter much more, as the SSDs work with
"erase blocks" (mostly 512kB), and I suspect using small stripe might
result in repeated writes to the same block - overwriting one block
repeatedly and thus increased wearout. But maybe the controller will
handle that just fine, e.g. by coalescing the writes and sending them to
the drive as a single write. Or maybe the drive can do that in local
write cache (all SSDs have that).

I have heard that genuine raid card with genuine ssds are optimized in
these ssds. It is important that using compatible with ssd for
performance. If the worst case, life time of ssd is be short, and will
be bad performance.

Well, that's the main question here, right? Because if the "worst case"
actually happens to be true, then what's the point of SSDs?
Sorry, this thread topic is SSD stiriping size tuning. I'm interested in magnetic 
disk especially. But also interested SSD.

You have a
disk that does not provite the performance you expected, died much
sooner than you expected and maybe suddenly so it interrupted the operation.
So instead of paying more for higher performance, you paid more for bad
performance and much shorter life of the disk.
I'm intetested in that changing raid chunk size will be short life. I have not 
had this point. It mgiht be true. And I'd like to test it using SMART cheacker if 
we have time.

Coincidentally we're currently trying to find the answer to this
question too. That is - how long will the SSD endure in that particular
RAID level? Does that pay off?

BTW what you mean by "genuine raid card" and "genuine ssds"?
I want to say "genuine" as it is same manufacturing maker or vendor.

I'm wondering about effective of readahead in OS and raid card. In
general, readahead data by raid card is stored in raid cache, and
not stored in OS caches. Readahead data by OS is stored in OS cache.
I'd like to use all raid cache for only write cache, because fsync()
becomes faster. But then, it cannot use readahead very much by raid
card.. If we hope to use more effectively, we have to clear it, but
it seems difficult:(

I've done a lot of testing of this on H710 in 2012 (~18 months ago),
measuring combinations of

    * read-ahead on controller (adaptive, enabled, disabled)
    * read-ahead in kernel (with various sizes)
    * scheduler

The test was the simplest and most suitable workload for this - just
"dd" with 1MB block size (AFAIK, would have to check the scripts).

In short, my findings are that:

    * read-ahead in kernel matters - tweak this
    * read-ahead on controller sucks - either makes no difference, or
      actually harms performance (adaptive with small values set for
      kernel read-ahead)
    * scheduler made no difference (at least for this workload)

So we disable readahead on the controller, use 24576 for kernel and it
works fine.

I've done the same test with fusionio iodrive (attached to PCIe, not
through controller) - absolutely no difference.
I'd like to know random access(8kB) performance, it does not seems it..
But this is inteteresting data. What command did you use kernel readahead paramter?
If you use blockdev, value 256 indicate using 256 * 512B(sector size)=128kB 
readahaed parameter.
And you set 245676, it will be 245676 * 512B = 120MB readahead parameter.
I think it is too big, but it is optimized in your enviroment.
In the end of the day, is it good for too big readahead, rather than small 
readahead or nothing? If we have big RAM, it seems true. But not in the 
situations, is it not? It is difficult problem.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance