Re: Raid10 multi core scaling

Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> · Wed, 27 Nov 2013 00:52:36 -0600

On 11/26/2013 4:58 AM, Pedro Teixeira wrote:
>    I created a Raid10 array with 16 sata 1TB disks and the array
> performance
> seems to be limited by the md0_raid10 taking 99% of one core and not
> scalling to other cores. 

The md RAID 5/6/10 drivers have a single write thread.  If you push
enough write IO you will peak one CPU core and hit a wall.  An effort is
currently underway to make use of multiple write threads, but this code
is not ready yet.

I tried overclocing the cpu cores and this lead to
> a small increase in performance ( but md0_raid10 keeps eating 99% of one
> core ).
> 
>    I'm using:
>     - a phenom X6 at 3600mhz
>     - 16 seagate SSHDs ( sata3 7200RPM with 8GB ssd cache )

So with this hardware you'll peak one CPU core until you've written
somewhere around 64GB, at which point you will have saturated the flash
cache on the drives.  After this point you should see a change from
being CPU bound to being disk bound, as you're writing at spindle speed.
 4x Marvell 88SE9230 based HBAs w/PCIe 2.0 x2 interfaces limit you to
4GB/s read/write throughput to flash cache.  The drives spindle
performance limits you to 2GB/s.  So somewhere in between 2-4GB/s your
3.6GHz Phenom core is running out of juice.

You should not be CPU/thread limited while reading, as reads are not
limited to a single thread.  With a pure streaming read you should be
able to get close to 4GB/s throughput, and you'll see multiple cores in
play, but the work is being done by other kernel IO threads, not the md
thread.

>    what I did to test performance was to force a check on the array, and
> this

This only tells you the behavior of resync, not a normal workload.

> leads to mdadm reporting a speed of about 990000K/sec. The hard disks
> report a 54% utilization. ( Overclocking the cpu by 200mhz increases the
> resync speed a bit and the hdd's utilizartion to about 58% )
> 
>    If I do the same with a raid5 array instead of raid10, them resync
> speed
> will be almost double of raid10, the harddisk utilization reported will be
> 98-100% and I can see at least two cores being used.

This is an apples to oranges comparison, so saying resync speed of RAID5
is double that of RAID10 doesn't mean anything.  Also, the RAID5 core
utilization you see is due to RAID5 using a second core for parity
calculations.

If you want RAID10 and you're hitting a wall at one core, your best
option currently is to build 8 RAID1 devices and build a RAID0 device of
these.  If resync is your preferred test method then you'd fire up 8
resyncs of the 8 RAID1 devices, in parallel, then sum the run times.
You can't resync a RAID0 device.  The total run time should be
significantly lower than using md/RAID10 or md/RAID5.  And you'll see
multiple cores in play, all of them actually, because you'll have 8
RAID1 devices and 6 cores.  But the utilization per core will be quite low.

There are other options to get around the core saturation problem.  You
could create multiple md/RAID10 arrays and lay a stripe over them or
concatenate them, such as a 2x8 or 4x4.  But you really must know what
you're doing to get the nested striping right, or properly layout XFS
AGs on a concatenation.  If not done properly performance could be worse
than what you have now.

Given the stripe over mirrors gets all cores in play, and doesn't have
such pitfalls, it's the better option by far.

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html