Re: md raid5 performace 6x SSD RAID5

Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> · Sun, 01 Dec 2013 23:51:36 -0600

On 12/1/2013 9:48 PM, lilofile wrote:
> #1 will eventually be addressed with a multi-thread patch to the various RAID drivers including RAID5
> 
> what is the differences between the multi-thread patch and the CONFIG_MULTICORE_RAID456?

I can't find the original description for that option, but I can tell
you that:

1.  It was experimental
2.  Neil Brown requested its complete removal from git in March 2013:

http://permalink.gmane.org/gmane.linux.kernel.commits.head/372527

> my understanding is CONFIG_MULTICORE_RAID456
>  enum {
> 	STRIPE_OP_BIOFILL,
> 	STRIPE_OP_COMPUTE_BLK,
> 	STRIPE_OP_PREXOR,
> 	STRIPE_OP_BIODRAIN,
> 	STRIPE_OP_RECONSTRUCT,
> 	STRIPE_OP_CHECK,
> };  this operations  in a stripe can be schedule to other CPU to run,
> 
> while  multi-thread patch  mainly modify lock contention of thread, this understanding is correct? 

Shaohua Li has been working on multi-threaded md drivers to fix the CPU
bottleneck with SSD storage for some time now.  He's currently focusing
on raid5.c.  See:
http://lwn.net/Articles/500200/
http://www.spinics.net/lists/raid/msg44699.html

AFAIK this work is not yet fully completed nor thoroughly tested, nor
included in a stable release.  Shaohua, could you give us a quick update
on the status of your RAID5 multi-thread work?  Demand for it seems to
be steeply increasing recently, this current thread, and another last
week with slow RAID10 on the new hybrid SSD/rust drives.

> ------------------------------------------------------------------
> 发件人：lilofile <lilofile@xxxxxxxxxx>
> 发送时间：2013年11月28日(星期四) 19:54
> 收件人：stan <stan@xxxxxxxxxxxxxxxxx>; Linux RAID <linux-raid@xxxxxxxxxxxxxxx>
> 主　题：答复：答复：md raid5 performace 6x SSD RAID5
> 
> I have change stripe cache size from   4096 stripe cache to  8192, the test result show the performance improve <5%, maybe The effect is not very obvious。

IIRC, this was before you started testing with FIO.  I'd really like to
see your streaming read/write results of FIO with the command line I
gave you, for each of these 3 stripe_cache_size values.  BTW, you don't
need to set a timer.  The size=30G limits the test to 30GB.  I chose
this value because the test runs should only take 15s at this size.  Go
any smaller and it makes capturing accurate data more difficult.

The reason for running the streaming tests is that it eliminates the RMW
code path and any associated latencies you get with the random write
test.  The command line I gave you should give us an idea of the peak
streaming read/write throughput of your SSD RAID5 array with the only
limitation being single core performance.

To discover how much CPU is being burned, concurrently with each FIO
test, execute the following as well once FIO initialization is complete
and the actual read/write tests begin.  This will show us what your CPU
consumption looks like and if you're hitting the single core ceiling
with the md write thread.  This will give you 20 seconds of CPU stats
polled every .5s:

~# top -b -n 40 -d 0.5 |grep Cpu|mawk '{print ($1,$3,$4) }'

This will generate a lot of output.  Piping through mawk will clean this
up making it easier to see which CPU is running the md write thread
during your write tests.  The FIO threads will execute in user space,
the md write thread in system space.  You won't see one core peaking
during read tests as any/all CPUs may be used.

Which kernel version are you using?  I don't recall you saying.  With
later kernels IIRC the parity calculations are offloaded to another
thread, so you may see high load on two cores.

> ------------------------------------------------------------------
> 发件人：Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx>
> 发送时间：2013年11月28日(星期四) 12:41
> 收件人：lilofile <lilofile@xxxxxxxxxx>; Linux RAID <linux-raid@xxxxxxxxxxxxxxx>
> 主　题：Re: 答复：md raid5 performace 6x SSD RAID5
> 
> On 11/27/2013 7:51 AM, lilofile wrote:
>> additional: CPU: Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz
>>                 memory:32GB
> ...
>> when I create raid5 which use six SSD(sTEC s840),
>> when the stripe_cache_size is set 4096. 
>> root@host1:/sys/block/md126/md# cat /proc/mdstat 
>> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
>> md126 : active raid5 sdg[6] sdf[4] sde[3] sdd[2] sdc[1] sdb[0]
>>       3906404480 blocks super 1.2 level 5, 128k chunk, algorithm 2 [6/6] [UUUUUU]
>>
>> the single ssd read/write performance :
>>
>> root@host1:~# dd if=/dev/sdb of=/dev/zero count=100000 bs=1M
>> ^C76120+0 records in
>> 76119+0 records out
>> 79816556544 bytes (80 GB) copied, 208.278 s, 383 MB/s
>>
>> root@host1:~# dd of=/dev/sdb if=/dev/zero count=100000 bs=1M
>> 100000+0 records in
>> 100000+0 records out
>> 104857600000 bytes (105 GB) copied, 232.943 s, 450 MB/s
>>
>> the raid read and write performance is  approx 1.8GB/s read and 1.1GB/s write performance
>> root@sc0:/sys/block/md126/md# dd if=/dev/zero of=/dev/md126 count=100000 bs=1M
>> 100000+0 records in
>> 100000+0 records out
>> 104857600000 bytes (105 GB) copied, 94.2039 s, 1.1 GB/s
>>
>>
>> root@sc0:/sys/block/md126/md# dd of=/dev/zero if=/dev/md126 count=100000 bs=1M
>> 100000+0 records in
>> 100000+0 records out
>> 104857600000 bytes (105 GB) copied, 59.5551 s, 1.8 GB/s
>>
>> why the performance is so bad?  especially the write performace.
> 
> There are 3 things that could be, or are, limiting performance here.
> 
> 1.  The RAID5 write thread peaks one CPU core as it is single threaded
> 2.  A 4KB stripe cache is too small for 6 SSDs, try 8KB
> 3.  dd issues IOs serially and will thus never saturate the hardware
> 
> #1 will eventually be addressed with a multi-thread patch to the various
> RAID drivers including RAID5.  There is no workaround at this time.
> 
> To address #3 use FIO or a similar testing tool that can issue IOs in
> parallel.  With SSD based storage you will never reach maximum
> throughput with a serial data stream.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html