Re: LIO performance bottle neck analyze

Zhu Lingshan <LSZhu@xxxxxxxx> · Thu, 26 Mar 2015 15:04:12 +0800

Hi Sagi,

Thanks a lot for your guidance, I tried your ideas, here is the result:

(1) Did you use any block layer settings (e.g. I/O schduler, rq_affinity,
nomerges) to mask out any staging effects?

No, in the previous tests, I just used default settings,
(It is a RAM disk created by modprobe brd rd_size=4200000 max_part=1 
rd_nr=1 )

nomerges=0, seems this default value zero can help IO; rq_affinity=0, I 
would try to set it to 1, seems set it to 1 can help when single CPU 
core reaches full workload; hw_sector_size=512; minimum_io_size=512; 
optimal_io_size=512;  physical_block_size=512; nr_requests=128; 
rotational=1, this seems strange in a RAM disk, I would try 0. But I 
think I should never turn it to 0 when using disk.

When I set rotational to 0, In a single process 512Bytes reading case, 
it did not help in transaction throughput, but on LIO target side, it 
reduced 5% CPU single core cycles in SYS.
I set rq_affinity to 1, but I did not reach a single core full workload 
this time, maybe I reached the switch limitation this time. But I think 
maybe setting rq_affinity to 1 can improve the performance when single 
core 100% fully loaded.

So, shall we do some performance tunning by change the values of the 
parameters from their default values?

(2)The iSCSI target attaches a session threads to a specific core. This
would be useful for latency by minimizing the number of context switches

If you were to use more sessions (usually common that a storage array
service more than a single initiator) you would see that the cores
utilization is linear with the number of active sessions. Having said
that, I'm sure there is room to improve that for the single session case.

Yes! I tried more than one initiator, I can see more than one core on 
the LIO side running, Thanks for the information!

(3) I wander if you should get different results by running rd_mcp 
backstore
to mask out the backend block layer effects (although these should be
minor).

In 512Bytes reading case, this can help improve the throughput from 
2.8M/s to ~3.0M/s, I think maybe LIO did some direct RAM operation 
instead of block layer functions.

(4) As I mentioned above, the core utilization would be linear to
the number of active sessions and not the number of IO threads
on the initiator side.

I believe this would be improved once iSCSI initiator would support
multi-queue.

I totally agree with you, and I saw the number of running cores be 
linear to the active sessions. I have read your mail on iscsi mq, maybe 
I can get more documents, try to help there.

Anyway, I am not only trying to find the bottle neck of LIO, but the 
most important is trying  to improve it, contribution. Thanks for your 
help and suggestions, would you please give me some comments on the test 
environment or test case to measure / find the the bottle neck of LIO? 
Or what is your idea about the bottle neck or the point we can improve it?

Thanks a again!
Have a nice day!

Zhu Lingshan

On 03/19/2015 07:01 PM, Sagi Grimberg wrote:
On 3/19/2015 4:20 AM, Zhu Lingshan wrote:

Hi,

I have been working on LIO performance work for weeks, now I can release
some results and issues, in this mail, I would like to talk about issues
on CPU usage and  transaction speed. I really hope can get some hints
and suggestion from you!

Summary:
(1) In 512Bytes, single process, reading case, I found the transaction
speed is 2.818MB/s in a 1GB network, the running CPU core in initiator
side spent over 80% cycles in waiting, while one core of LIO side spent
43.6% in Sys, even no cycles in user, no cycles in wait. I assume the
bottle neck of this small package, one thread transaction is the lock
operations on LIO target side.

Did you use any block layer settings (e.g. I/O schduler, rq_affinity,
nomerges) to mask out any staging effects?

(2) In 512Bytes, 32 process, reading case, I found the transaction speed
is 11.259MB/s in a 1GB network, I found there is only one CPU core in
the LIO target side running, and the load is 100% in SYS. While other
cores totally free, no workload. I assume the bottle neck of this small
package, multi threads transaction is the that, no workload balance on
target side.

The iSCSI target attaches a session threads to a specific core. This
would be useful for latency by minimizing the number of context switches

If you were to use more sessions (usually common that a storage array
service more than a single initiator) you would see that the cores
utilization is linear with the number of active sessions. Having said
that, I'm sure there is room to improve that for the single session case.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 

Here are all detailed information:

My environment:
Two blade severs with E5 CPU and 32GB ram, one run LIO and the other is
the initiator.
ISCSI backstore: RAM disk, I use the command line "modprobe brd
rd_size=4200000 max_part=1 rd_nr=1" to create it.(/dev/ram0, and in the
initiator side it is /dev/sdc).

I wander if you should get different results by running rd_mcp backstore
to mask out the backend block layer effects (although these should be
minor).

1GB network.
OS: SUSE Enterprise Linux Sever on both sides, kernel version 3.12.28-4.
Initiator: Open-iSCSI Initiator 2.0873-20.4
LIO-utils: version: 4.1-14.6
My tools: perf, netperf, nmon, FIO

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 

For case (1):

In 512Bytes, single process, reading case, I found the transaction speed
is 2.897MB/s in a 1GB network, the running CPU core in initiator side
spent over 80% cycles in waiting, while one core of LIO side spent 43.6%
in Sys, even no cycles in user, no cycles in wait.

I run this test case by the command line:
fio -filename=/dev/sdc  -direct=1 -rw=read  -bs=512 -size=2G -numjobs=1
-runtime=600 -group_reporting -name=test.

part of the results:
Jobs: 1 (f=1): [R(1)] [100.0% done] [2818KB/0KB/0KB /s] [5636/0/0 iops]
[eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=1258: Mon Mar 16 21:48:14 2015
   read : io=262144KB, bw=2897.8KB/s, iops=5795, runt= 90464msec

I run a netperf test with buffer set to 512Bytes and 512Bytes per
package, get a transaction speed of 6.5MB/s, better than our LIO did, so
I tried nmon and perf to find why.
This is the screen shot of what nmon show about CPU in the initiator 
side:

┌nmon─14i─────────────────────Hostname=INIT─────────Refresh=10secs
───21:30.42────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ 

│ CPU Utilisation
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 

│
│---------------------------+-------------------------------------------------+ 

│
│CPU  User%  Sys% Wait% Idle|0          |25         |50 |75 100| │
│  1   0.0   0.0   0.2 99.8|> | │
│  2   0.1   0.1   0.0 99.8|> | │
│  3   0.0   0.2   0.0 99.8|> | │
│  4   0.0   0.0   0.0 100.0|> | │
│  5   0.0   0.0   0.0 100.0|> | │
│  6   0.0   3.1   0.0 96.9|s> | │
│  7   2.8  12.2  83.8
1.2|UssssssWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW>| │
│  8   0.0   0.0   0.0 100.0|> | │
│  9   0.0   0.0   0.0 100.0|> | │
│ 10   0.0   0.0   0.0 100.0|> | │
│ 11   0.0   0.0   0.0 100.0|> | │
│ 12   0.0   0.0   0.0 100.0|> | │
│---------------------------+-------------------------------------------------+ 

│
│Avg   0.2   1.1   5.8 92.8|WW> | │
│---------------------------+-------------------------------------------------+ 

We can see on the initiator side, there is only one core running, that
is ok, but this core spent 83.8% in wait, that seems strange, while on
the LIO target side, the only running core spent 43.6% in SYS, even no
cycles in user or wait. Why the initiator waited while there is still
some free resource(CPU core cycles) on the target side? Then I use perf
record to monitor the LIO target, I find locks, especially spin lock
consumed nearly 40% CPU cycles. I assume this is the reason why the
initiator side shown wait and low speed,lock operation is the bottle
neck of this case(small package, single thread transaction) Do you have
any comments on that?

IMO, the fact that your single core is not at 100% means that the
bottleneck does not originate in a spinlock contention.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 

For case (2):
In 512Bytes, 32 process, reading case, I found the transaction speed is
11.259MB/s in a 1GB network, I found there is only one CPU core in the
LIO target side running, and the load is 100% in SYS. While other cores
totally free, no workload.

I run the case by this command line:
fio -filename=/dev/sdc  -direct=1 -rw=read  -bs=512 -size=4GB
-numjobs=32 -runtime=600 -group_reporting -name=test.

The speed is 11.259MB/s. On the LIO target side, I found only one cpu
core running, all other cores totally free. It seems that  there is not
a workload balance scheduler. It seems the bottle neck of this
case(small package, multi threads transaction). Is it nice to be some
code to balance the transaction traffic to all cores? Hope can get some
hints, suggestion and why from you experts!

As I mentioned above, the core utilization would be linear to
the number of active sessions and not the number of IO threads
on the initiator side.

I believe this would be improved once iSCSI initiator would support
multi-queue.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe 
target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html