Re: LIO performance bottle neck analyze

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Fri, 20 Mar 2015 13:26:45 -0700

Hi Zhu,

On Thu, 2015-03-19 at 10:20 +0800, Zhu Lingshan wrote:
> Hi,
> 
> I have been working on LIO performance work for weeks, now I can release 
> some results and issues, in this mail, I would like to talk about issues 
> on CPU usage and  transaction speed. I really hope can get some hints 
> and suggestion from you!
> 
> Summary:
> (1) In 512Bytes, single process, reading case, I found the transaction 
> speed is 2.818MB/s in a 1GB network, the running CPU core in initiator 
> side spent over 80% cycles in waiting, while one core of LIO side spent 
> 43.6% in Sys, even no cycles in user, no cycles in wait. I assume the 
> bottle neck of this small package, one thread transaction is the lock 
> operations on LIO target side.
> 
> (2) In 512Bytes, 32 process, reading case, I found the transaction speed 
> is 11.259MB/s in a 1GB network, I found there is only one CPU core in 
> the LIO target side running, and the load is 100% in SYS. While other 
> cores totally free, no workload. I assume the bottle neck of this small 
> package, multi threads transaction is the that, no workload balance on 
> target side.
> 
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Here are all detailed information:
> 
> 
> My environment:
> Two blade severs with E5 CPU and 32GB ram, one run LIO and the other is 
> the initiator.
> ISCSI backstore: RAM disk, I use the command line "modprobe brd 
> rd_size=4200000 max_part=1 rd_nr=1" to create it.(/dev/ram0, and in the 
> initiator side it is /dev/sdc).
> 1GB network.
> OS: SUSE Enterprise Linux Sever on both sides, kernel version 3.12.28-4.
> Initiator: Open-iSCSI Initiator 2.0873-20.4
> LIO-utils: version: 4.1-14.6
> My tools: perf, netperf, nmon, FIO
> 
> 
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> For case (1):
> 
> In 512Bytes, single process, reading case, I found the transaction speed 
> is 2.897MB/s in a 1GB network, the running CPU core in initiator side 
> spent over 80% cycles in waiting, while one core of LIO side spent 43.6% 
> in Sys, even no cycles in user, no cycles in wait.
> 
> I run this test case by the command line:
> fio -filename=/dev/sdc  -direct=1 -rw=read  -bs=512 -size=2G -numjobs=1 
> -runtime=600 -group_reporting -name=test.
> 
> part of the results:
> Jobs: 1 (f=1): [R(1)] [100.0% done] [2818KB/0KB/0KB /s] [5636/0/0 iops] 
> [eta 00m:00s]
> test: (groupid=0, jobs=1): err= 0: pid=1258: Mon Mar 16 21:48:14 2015
>    read : io=262144KB, bw=2897.8KB/s, iops=5795, runt= 90464msec
> 
> I run a netperf test with buffer set to 512Bytes and 512Bytes per 
> package, get a transaction speed of 6.5MB/s, better than our LIO did, so 
> I tried nmon and perf to find why.

So 6.5 MB/sec bandwidth with netperf for small packets seems really low,
even for a 1 Gb/sec port with 1500 byte MTU.

> We can see on the initiator side, there is only one core running, that 
> is ok, but this core spent 83.8% in wait, that seems strange, while on 
> the LIO target side, the only running core spent 43.6% in SYS, even no 
> cycles in user or wait. Why the initiator waited while there is still 
> some free resource(CPU core cycles) on the target side? Then I use perf 
> record to monitor the LIO target, I find locks, especially spin lock 
> consumed nearly 40% CPU cycles. I assume this is the reason why the 
> initiator side shown wait and low speed,lock operation is the bottle 
> neck of this case(small package, single thread transaction) Do you have 
> any comments on that?
> 
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 
> For case (2):
> In 512Bytes, 32 process, reading case, I found the transaction speed is 
> 11.259MB/s in a 1GB network, I found there is only one CPU core in the 
> LIO target side running, and the load is 100% in SYS. While other cores 
> totally free, no workload.
> 
> I run the case by this command line:
> fio -filename=/dev/sdc  -direct=1 -rw=read  -bs=512 -size=4GB 
> -numjobs=32 -runtime=600 -group_reporting -name=test.
> 
> The speed is 11.259MB/s. On the LIO target side, I found only one cpu 
> core running, all other cores totally free. It seems that  there is not 
> a workload balance scheduler. It seems the bottle neck of this 
> case(small package, multi threads transaction). Is it nice to be some 
> code to balance the transaction traffic to all cores? Hope can get some 
> hints, suggestion and why from you experts!

As mentioned by Sagi, I don't think you're hitting any LIO bottlenecks
at ~10 MB/sec with a BRD backend.

I'd recommend troubleshooting the network first, to figure out why small
packet performance is so low regardless of application layer protocol.

Also, it would be useful to pinpoint at which packet size the network
begins to have performance problems.  Eg:  Which FIO block_size are you
able to saturate the 1 Gb/sec link (~110 MB/sec)..?

--nab

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html