Re: Suggestions for a HBA controller (6 x SSDs + madam RAID10)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Disclaimer: I’ve done extensive testing (FIO and postgres) with a few different RAID controllers and HW RAID vs mdadm. We (micron) are crucial but I don’t personally work with the consumer drives.

 

Verify whether you have your disk write cache enabled or disabled. If it’s disabled, that will have a large impact on write performance.

 

Is this the *exact* string you used? `fio --filename=/dev/sdx --direct=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k --rwmixread=100 --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=4ktest`

 

With FIO, you need to multiply iodepth by numjobs to get the final queue depth its pushing. (in this case, 256). Make sure you’re looking at the correct data.

 

Few other things:

-          Mdadm will give better performance than HW RAID for specific benchmarks.

-          Performance is NOT linear with drive count for synthetic benchmarks.

-          It is often nearly linear for application performance.

-          HW RAID can give better performance if your drives do not have a capacitor backed cache (like the MX300) AND the controller has a battery backed cache. *Consumer drives can often get better performance from HW RAID*. (otherwise MDADM has been faster in all of my testing)

-          Mdadm RAID10 has a bug where reads are not properly distributed between the mirror pairs. (It uses head position calculated from the last IO to determine which drive in a mirror pair should get the next read. It results in really weird behavior of most read IO going to half of your drives instead of being evenly split as should be the case for SSDs).  You can see this by running iostat while you’ve got a load running and you’ll see uneven distribution of IOs. FYI, the RAID1 implementation has an exception where it does NOT use head position for SSDs. I have yet to test this but you should be able to get better performance by manually striping a RAID0 across multiple RAID1s instead of using the default RAID10 implementation.

-          Don’t focus on 4k Random Read. Do something more similar to a PG workload (64k 70/30 R/W @ QD=4 is *reasonably* close to what I see for heavy OLTP). I’ve tested multiple controllers based on the LSI 3108 and found that default settings from one vendor to another provide drastically different performance profiles. Vendor A had much better benchmark performance (2x IOPS of B) while vendor B gave better application performance (20% better OLTP performance in Postgres). (I got equivalent performance from A & B when using the same settings).

 

 

Wes Vaske

Senior Storage Solutions Engineer

Micron Technology

 

From: pgsql-performance-owner@xxxxxxxxxxxxxx [mailto:pgsql-performance-owner@xxxxxxxxxxxxxx] On Behalf Of Merlin Moncure
Sent: Tuesday, February 21, 2017 9:05 AM
To: Pietro Pugni <pietro.pugni@xxxxxxxxx>
Cc: pgsql-performance@xxxxxxxxxxxxxx
Subject: Re: [PERFORM] Suggestions for a HBA controller (6 x SSDs + madam RAID10)

 

On Tue, Feb 21, 2017 at 7:49 AM, Pietro Pugni <pietro.pugni@xxxxxxxxx> wrote:

Hi there,

I configured an IBM X3650 M4 for development and testing purposes. It’s composed by:

 - 2 x Intel Xeon E5-2690 @ 2.90Ghz (2 x 8 physical Cores + HT)

 - 96GB RAM DDR3 1333MHz (12 x 8GB)

 - 2 x 146GB SAS HDDs @ 15k rpm configured in RAID1 (mdadm)

 - 6 x 525GB SATA SSDs (over-provisioned at 25%, so 393GB available)

 

I’ve done a lot of testing focusing on 4k and 8k workloads and found that the IOPS of those SSDs are half the expected. On serverfault.com someone suggested me that probably the bottle neck is the embedded RAID controller, a IBM ServeRaid m5110e, which mounts a LSI 2008 controller.

 

I’m using the disks in JBOD mode with mdadm software RAID, which is blazing fast. The CPU is also very fast, so I don’t mind having a little overhead due to software RAID.

 

My typical workload is Postgres run as a DWH with 1 to 2 billions of rows, big indexes, partitions and so on, but also intensive statistical computations.

 

 

and here’s a graph of those six SSDs evaluated using fio as stand-alone disks (outside of the RAID):

 

https://i.stack.imgur.com/ZMhUJ.png

 

All those IOPS should be doubled if all was working correctly. The curve trend is correct for increasing IO Depths.

 

 

Anyway, I would like to buy a HBA controller that leverages those 6 SSDs. Each SSD should deliver about 80k to 90k IOPS, so in RAID10 I should get ~240k IOPS (6 x 80k / 2) and in RAID0 ~480k IOPS (6 x 80k). I’ve seen that mdadm effectively scales performance, but the controller limits the overal IOPS at ~120k (exactly the half of the expected IOPS).

 

What HBA controller would you suggest me able to handle 500k IOPS? 

 

 

My server is able to handle 8 more SSDs, for a total of 14 SSDs and 1260k theoretical IOPS. If we imagine adding only 2 more disks, I will achieve 720k theoretical IOPS in RAID0. 

 

What HBA controller would you suggest me able to handle more than 700k IOPS? 

 

Have you got some advices about using mdadm RAID software on SATAIII SSDs and plain HBA?

 

Random points/suggestions:

*) mdadm is the way to go.   I think you'll get bandwidth constrained on most modern hba unless they are really crappy.  On reasonably modern hardware storage is rarely the bottleneck anymore (which is a great place to be).  Fancy raid controllers may actually hurt performance -- they are obsolete IMNSHO.

 

*) Small point, but you'll want to crank effective_io_concurrency (see: https://www.postgresql.org/message-id/CAHyXU0yiVvfQAnR9cyH%3DHWh1WbLRsioe%3DmzRJTHwtr%3D2azsTdQ%40mail.gmail.com). It only affects certain kinds of queries, but when it works it really works.    Those benchmarks were done on my crapbox dell workstation!

 

*) For very high transaction rates, you can get a lot of benefit from disabling synchronous_commit if you are willing to accommodate the risk.  I do not recommend disabling fsync unless you are prepared to regenerate the entire database at any time.

 

*) Don't assume indefinite linear scaling as you increase storage capacity -- the database itself can become the bottleneck, especially for writing.   To improve write performance, classic optimization strategies of trying to intelligently bundle writes around units of work still apply.  If you are expecting high rates of write activity your engineering focus needs to be here for sure (read scaling is comparatively pretty easy).

 

*) I would start doing your benchmarking with pgbench since that is going to most closely reflect measured production performance.  

 

> My typical workload is Postgres run as a DWH with 1 to 2 billions of rows, big indexes, partitions and so on, but also intensive statistical computations.

 

If this is the case your stack performance is going to be based on data structure design.  Make liberal use of:

*) natural keys 

*) constraint exclusion for partition selection

*) BRIN index is amazing (if you can work into it's limitations)

*) partial indexing

*) covering indexes.  Don't forget to vacuum your partitions before you make them live if you use them

 

If your data is going to get really big and/or query activity is expected to be high, keep an eye on your scale out strategy.   Going monolithic to bootstrap your app is the right choice IMO but start thinking about the longer term if you are expecting growth.  I'm starting to come out to the perspective that lift/shift scaleout using postgres fdw without an insane amount of app retooling could be a viable option by postgres 11/12 or so.  For my part I scaled out over asynchronous dblink which is a much more maintenance heavy strategy (but works fabulous although I which you could asynchronously connect).

 

merlin

 


[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux