On 2/7/2013 12:48 AM, Adam Goryachev wrote: > I'm trying to resolve a significant performance issue (not arbitrary dd > tests, etc but real users complaining, real workload performance). It's difficult to analyze your situation without even a basic description of the workload(s). What is the file access pattern? What types of files? > I'm currently using 5 x 480GB SSD's in a RAID5 as follows: > md1 : active raid5 sdf1[0] sdc1[4] sdb1[5] sdd1[3] sde1[1] > 1863535104 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/5] ... > Each drive is set to the deadline scheduler. Switching to noop may help a little, as may disablig NCQ, i.e. putting the driver in native IDE mode, or setting queue depth to 1. > Drives are: > Intel 520s MLC 480G SATA3 > Supposedly Read 550M/Write 520M > I think the workload being generated is simply too much for the > underlying drives. Not possible. With an effective spindle width of 4, these SSDs can do ~80K random read/write IOPS sustained. To put that into perspective, you would need a ~$150,000 high end FC SAN array controller with 270 15K SAS drives in RAID0 to get the same IOPS. The problem is not the SSDs. Probably not the controller either. > I've been collecting the information from > /sys/block/<drive>/stat every 10 seconds for each drive. What makes me > think the drives are overworked is that the backlog value gets very high > at the same time the users complain about performance. What is "very high"? Since you mention "backlog" I'll assume you're referring to field #11. If so, note that on my idle server (field #9 is 0), it is currently showing 434045280 for field #11. That's apparently a weighted value of milliseconds. And apparently it's not reliable as a diagnostic value. What you should be looking at is field #9, which simply tells you how may IOs are in progress. But even if this number is high, which it can be be very high with SSDs, it doesn't inform you if the drive is performing properly or not. What you should be using is ioptop or something similar. But this still isn't going to be all that informative. > The load is a bunch of windows VM's, which were working fine until > recently when I migrated the main fileserver/domain controller on > (previously it was a single SCSI Ultra320 disk on a standalone machine). > Hence, this also seems to indicate a lack of performance. You just typed 4 lines and told us nothing of how this relates to the problem you wish us to help you solve. Please be detailed. > Currently the SSD's are connected to the onboard SATA ports (only SATA II): > 00:1f.2 SATA controller: Intel Corporation Cougar Point 6 port SATA AHCI > Controller (rev 05) Unless this Southbridge has a bug (I don't have time to research it), then this isn't the problem. > There is one additional SSD which is just the OS drive also connected, > but it is mostly idle (all it does is log the stats/etc). Irrelevant. > Assuming the issue is underlying hardware It's not. > 1) Get a battery backed RAID controller card (which should improve > latency because the OS can pretend it is written while the card deals > with writing it to disk). [BB/FB]WC is basically useless with SSDs. LSI has the best boards, and the "FastPath" option for SSDs basically disables the onboard cache to get it out of the way. Enterprise SSDs have extra capacitance allowing for cache flushing on power loss so battery/flash protection on the RAID card isn't necessary. The write cache on the SSDs themselves is faster in aggregate than the RAID card's ASIC and cache RAM interface, thus having BBWC on the card enabled with SSDs actually slows you down. So, in short, this isn't the answer to your problem, either. > 2) Move from a 5 disk RAID5 to a 8 disk RAID10, giving better data > protection (can lose up to four drives) and hopefully better performance > (main concern right now), and same capacity as current. You've got plenty of hardware performance. Moving to RAID10 will simply cost more money with no performance gain. Here's why: md/RAIAD5 and md/RAID10 both rely on a single write thread. If you've been paying attention on this list you know that patches are in the works to fix this but are not, AFAIK, in mainline yet, and a long way from being in distro kernels. So, you've got maximum possible read performance now, but your *write performance is limited to a single CPU core* with both of these RAID drives. If your problem is write performance, your only solution at this time with md is to use a layered RAID, such as RAID0 over RAID1 pairs, or linear over RAID1 pairs. This puts all of your cores in play for writes. The reason this is an issue is that even a small number of SSDs can overwhelm a single md thread, which is limited to one core of throughput. This has also been discussed thoroughly here recently. > The real questions are: > 1) Is this data enough to say that the performance issue is due to > underlying hardware as opposed to a mis-configuration? No, it's not. We really need to have more specific workload data. > 2) If so, any suggestions on specific hardware which would help? It's not a hardware problem. Given that it's a VM consolidation host, I'd guess it's a hypervisor configuration problem. > 3) Would removing the bitmap make an improvement to the performance? I can't say this any more emphatically. You have 5 of Intel's best consumer SSDs and an Intel mainboard. The problem is not your hardware. > Motherboard is Intel S1200BTLR Serverboard - 6xSATAII / Raid 0,1,10,5 > > It is possibly to wipe the array and re-create that would help....... Unless you're write IOPS starved due to md/RAID5 as I described above, blowing away the array and creating a new one isn't going to help. You simply need to investigate further. And if you would like continued assistance, you'd need to provide much greater detail of the hardware and workload. You didn't mention your CPU(s) model/freq. This matters greatly with RAID5 and SSD. Nor RAM type/capacity, network topology, nor number of users and what applications they're running when they report the performance problem. Nor did you mention which hypervisor kernel/distro you're using, how many Windows VMs you're running, and the primary workload of each, etc, etc, etc. > Any comments, suggestions, advice greatly received. More information, please. -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html