Re: RAID performance - new kernel results - 5x SSD RAID5

Adam Goryachev <mailinglists@xxxxxxxxxxxxxxxxxxxxxx> · Thu, 21 Feb 2013 17:40:20 +1100

On 21/02/13 17:04, Stan Hoeppner wrote:
> Simply reading 'man top' tells you that hitting 'w' writes the change.
> As you didn't have the per CPU top layout previously, I can only assume
> you don't use top very often, if at all.  top is a fantastic diagnostic
> tool when used properly.  Learn it, live it, love it. ;)

haha, yes, I do use top a lot, but I guess I've never learned it very
well. Everything I know about linux has been self-learned, and I guess
until I have a problem, or a need, then I don't tend to learn about it.
I've mostly worked for ISP's as linux sysadmin for the past 16 years or
so....

>> Output is as follows:
> With HT, this output for 8 "cpus" and line wrapping, it's hard to make
> heads/tails.  I see in your header you use Thundebird 17 as I do.  Did
> you notice my formatting of top output wasn't wrapped?  To fix the
> wrapping, after you paste it into the compose windows, select it all,
> then click Edit-->Rewrap.  And you get this:

Funny, I never thought to use that feature like that. For me, I only
ever used it to help line wrap really long lines that were quoted from
someone else email. Didn't know it could make my lines longer (without
manually adjusting the global linewrap character count). Thanks for
another useful tip :)

I'll repost numbers after I disable HT, no point right now.

> We're looking for a pegged CPU, not idle ones.  Most will be idle, or
> should be idle, as this is a block IO server.  And yes, %wa means the
> CPU is waiting on an IO device.  With 5 very fast SSDs in RAID5, we
> shouldn't be seeing much %wa.  And during a sustained streaming write, I
> would expect to see one CPU core pegged at 99% for the duration of the
> FIO run, or close to it.  This will be the one running the mdraid5 write
> thread.  If we see something other than this, such as heavy %wa, that
> may mean there's something wrong elsewhere in the system, either
> kernel/parm, or hardware.

Yes, I'm quite sure that there was no CPU with close to 0% idle (or
100%sy) for the duration of the test. In any case, I'll re-run the test
and advise in a few days.

> FYI for future Linux server deployments, it's very rare that a server
> workload will run better with HT enabled.  In fact they most often
> perform quite a bit worse with HT enabled.  The ones that may perform
> better are those such as IMAP servers with hundreds or thousands of user
> processes, most sitting idle, or blocking on IO.  For a block IO server
> with very few active processes, and processes that need all possible CPU
> bandwidth for short intervals (mdraid5 write thread), HT reduces CPU
> bandwidth due to switching between two hardware threads on one core.
> 
> Note that Intel abandoned HT with the 'core' series of CPUs, and
> reintroduced it with the Nehalem series.  AMD has never implemented HT
> (SMT) it its CPUs.  And if you recall Opterons beat the stuffing out of
> Xeons for many, many years.

Yes, and I truly loved telling customers that AMD CPU's were both
cheaper AND better performing. Those were amazing days for AMD. To be
honest, I don't read enough about CPU's anymore, but my understanding is
that AMD are a little behind on the performance curve, but not far
enough that I wouldn't want to use them....

>> I don't think it is from my measurements...
> 
> It may not be but it's too early to tell.  After we have some readable
> output we'll be able to discern more.  It may simply be that you're
> re-writing the same small 15GB section of the SSDs, causing massive
> garbage collection, which in turn causes serious IO delays.  This is one
> of the big downsides to using SSDs as SAN storage and carving it up into
> small chunks.  The more you write large amounts to small sections, the
> more GC kicks in to do wear leveling.  With rust you can overwrite the
> same section of a platter all day long and the performance doesn't change.

True, I can allocate a larger LV for testing (I think I have around 500G
free at the moment, just let me know what size I should allocate/etc...)

> Whatever the resulting data, it should help point us to the cause of the
> write performance problem, whether it's CPU starvation of the md write
> thread, or something else such as high IO latency due to something like
> I described above, or something else entirely, maybe the FIO testing
> itself.  We know from other peoples' published results that these Intel
> 520s SSDs are capable of seq write performance of 500MB/s with a queue
> depth greater than 2.  You're achieving full read bandwidth, but only
> 1/3rd the write bandwidth.  Work with me and we'll get it figured out.

Sounds good, thanks.

>> Let me know if you think I
>> should run any other tests to track it down...
> 
> Can't think of any at this point.  Any further testing will depend on
> the results of good top output from the next FIO run.  Were you able to
> get all the SSD partitions starting at a sector evenly divisible by 512
> bytes yet?  That may be of more benefit than any other change.  Other
> than testing on something larger than a 15GB LV.

All drives now look like this (fdisk -ul)
Disk /dev/sdb: 480 GB, 480101368320 bytes
255 heads, 63 sectors/track, 58369 cylinders, total 937697985 sectors
Units = sectors of 1 * 512 = 512 bytes

Device Boot Start End Blocks Id System
/dev/sdb1 64 931770000 465893001 fd Lnx RAID auto
Warning: Partition 1 does not end on cylinder boundary.

I think (from the list) that this should now be correct...

>> One thing I can see is a large number of interrupts and context switches
>> which looks like it happened at the same time as a backup run. Perhaps I
>> am getting too many interrrupts on the network cards or the SATA controller?
> 
> If cpu0 isn't peaked your interrupt load isn't too high.  Regardless,
> installing irqbalance is a good idea for a multicore iSCSI server with 2
> quad port NICs and a high IOPS SAS controller with SSDs attached.  This
> system is the poster boy for irqbalance.  As the name implies, the
> irqbalance daemon spreads the interrupt load across many cores.  Intel
> systems by default route all interrupts to core0.  The 0.56 version in
> Squeeze I believe does static IRQ routing, each device's (HBA)
> interrupts are routed to a specific core based on discovery.  So, say,
> LSI routes to core1, NIC1 to core2, NIC2 to core 3.  So you won't get an
> even spread, but at least core0 is no longer handling the entire
> interrupt load.  Wheezy ships with 1.0.3 which does dynamic routing, so
> on heavily loaded systems (this one is actually not) the spread is much
> more even.

OK, currently all IRQ's are on CPU0 (/proc/interrupts). I've installed
irqbalance, and it has already started to spread interrupts across the
CPU's. I am pretty sure I started doing some irq balancing a few months
ago, but I was doing it manually, and set the onboard SATA to one CPU,
each pair of ethernet ports to another, and everything else to the last.
I tried to skip the HT CPU's. I think this is going to be a better
solution, especially once I disable HT.

> WRT context switches, you'll notice this drop substantially after
> disabling HT.  And if you think this value is high, compare it to one of
> the Terminal Services Xen boxen.  Busy hypervisors and terminal servers
> generate the most CS/s of any platform, by far, and you've got both on a
> single box.

Speaking of which, I've found another few issues that are not related to
the RAID write speed, but may be related to the end user experience.

Tonight, I will increase each xen physical box from having 1 CPU pinned,
to having 2 CPU's pinned.

The Domain Controller/file server (windows 2000) is configured for 2
vCPU, but is only using one since windows itself is not setup for
multiple CPU's. I'll change the windows driver and in theory this should
allow dual CPU support.

Generally speaking, complaints have settled down, and I think most users
are basically happy. I've still had a few users with "outlook crashing",
and I've now seen that usually the PST file is corrupt. I'm hopeful that
running the scanpst tool will fix the corruptions and stop the outlook
crashes. In addition, I've found the user with the biggest complaints
about performance has a 9GB pst file, so a little pruning will improve
that I suspect.

So, I think between the above couple of things, and all the other work
already done, the customer is relatively comfortable (I won't say happy,
but maybe if we can survive a few weeks without any disaster...).
Personally, I'd like to improve the RAID performance, just because it
should, but at least I can relax a little, and dedicate some time to
other jobs, etc...

So, summary:
1) Disable HT
2) Increase test LV to 100G
3) Re-run fio test
4) Re-collect CPU stats

Sound good?

Thanks,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html