>From: Mikael Carneholm <Mikael.Carneholm@xxxxxxxxxxxxxxx> >Sent: Jul 16, 2006 6:52 PM >To: pgsql-performance@xxxxxxxxxxxxxx >Subject: [PERFORM] RAID stripe size question > >I have finally gotten my hands on the MSA1500 that we ordered some time >ago. It has 28 x 10K 146Gb drives, > Unless I'm missing something, the only FC or SCSI HDs of ~147GB capacity are 15K, not 10K. (unless they are old?) I'm not just being pedantic. The correct, let alone optimal, answer to your question depends on your exact HW characteristics as well as your SW config and your usage pattern. 15Krpm HDs will have average access times of 5-6ms. 10Krpm ones of 7-8ms. Most modern HDs in this class will do ~60MB/s inner tracks ~75MB/s avg and ~90MB/s outer tracks. If you are doing OLTP-like things, you are more sensitive to latency than most and should use the absolute lowest latency HDs available within you budget. The current latency best case is 15Krpm FC HDs. >currently grouped as 10 (for wal) + 18 (for data). There's only one controller (an emulex), but I hope >performance won't suffer too much from that. Raid level is 0+1, >filesystem is ext3. > I strongly suspect having only 1 controller is an I/O choke w/ 28 HDs. 28HDs as above setup as 2 RAID 10's => ~75MBps*5= ~375MB/s, ~75*9= ~675MB/s. If both sets are to run at peak average speed, the Emulex would have to be able to handle ~1050MBps on average. It is doubtful the 1 Emulex can do this. In order to handle this level of bandwidth, a RAID controller must aggregate multiple FC, SCSI, or SATA streams as well as down any RAID 5 checksumming etc that is required. Very, very few RAID controllers can do >= 1GBps One thing that help greatly with bursty IO patterns is to up your battery backed RAID cache as high as you possibly can. Even multiple GBs of BBC can be worth it. Another reason to have multiple controllers ;-) Then there is the question of the BW of the bus that the controller is plugged into. ~800MB/s is the RW max to be gotten from a 64b 133MHz PCI-X channel. PCI-E channels are usually good for 1/10 their rated speed in bps as Bps. So a PCI-Ex4 10Gbps bus can be counted on for 1GBps, PCI-Ex8 for 2GBps, etc. At present I know of no RAID controllers that can singlely saturate a PCI-Ex4 or greater bus. ...and we haven't even touched on OS, SW, and usage pattern issues. Bottom line is that the IO chain is only as fast as its slowest component. >Now to the interesting part: would it make sense to use different stripe >sizes on the separate disk arrays? > The short answer is Yes. WAL's are basically appends that are written in bursts of your chosen log chunk size and that are almost never read afterwards. Big DB pages and big RAID stripes makes sense for WALs. Tables with OLTP-like characteristics need smaller DB pages and stripes to minimize latency issues (although locality of reference can make the optimum stripe size larger). Tables with Data Mining like characteristics usually work best with larger DB pages sizes and RAID stripe sizes. OS and FS overhead can make things more complicated. So can DB layout and access pattern issues. Side note: a 10 HD RAID 10 seems a bit much for WAL. Do you really need 375MBps IO on average to your WAL more than you need IO capacity for other tables? If WAL IO needs to be very high, I'd suggest getting a SSD or SSD-like device that fits your budget and having said device async mirror to HD. Bottom line is to optimize your RAID stripe sizes =after= you optimize your OS, FS, and pg design for best IO for your usage pattern(s). Hope this helps, Ron