Re: lio taget iscsi multiple core performance

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Thu, 03 Oct 2013 01:47:26 -0700

On Wed, 2013-10-02 at 22:59 -0500, Xianghua Xiao wrote:
> First I use lio-utils instead of targetcli, as this is an embedded box
> that has very limited python packages built-in.
> 
> On Wed, Oct 2, 2013 at 5:26 PM, Nicholas A. Bellinger
> <nab@xxxxxxxxxxxxxxx> wrote:
> > On Wed, 2013-10-02 at 14:07 -0500, Xianghua Xiao wrote:
> >> after I changed default_cmdsn_depth to 64 I use iomter to do READ,
> >> only core0 is busy, for WRITE, all cores(12 of them) are equally busy.
> >>
> >
> > Have you been able to isolate the issue down to per session
> > performance..?  What happens when the same MD RAID backend is accessed
> > across multiple sessions via a different TargetName+TargetPortalGroupTag
> > endpoint..?  Does the performance stay the same..?
> >
> > Also, it would be useful to confirm with a rd_mcp backend to determine
> > if it's something related to the fabric (eg: iscsi) or something related
> > to the backend itself.
> >
> I have 12 RAID5 built from 4 SSDs(each SSD has 8 partitions). Only the
> first two of key steps are shown here:
> tcm_node --block iblock_0/my_iblock0 /dev/md0
> tcm_node --block iblock_1/my_iblock1 /dev/md1
> ...
> lio_node --addlun iscsi-test0 1 0 lun_my_block iblock_0/my_iblock0
> lio_node --addlun iscsi-test1 1 0 lun_my_block iblock_1/my_iblock1
> ...
> lio_node --addnp iscsi-test0 1 172.16.0.1:3260
> lio_node --addnp iscsi-test1 1 172.16.0.1:3260
> ...
> lio_node --enabletpg iscsi-test0 1
> lio_node --enabletpg iscsi-test1 1
> ...

Ok, so you've got 12 individual TargetName+TargetPortalGroupTag
endpoints with a 1:1 LUN mapping, which translates to 12 individual
sessions.  More than enough TCP connections to go 10 Gb/sec line rate
for large block I/O.

One extra thing you'll want to enable on each of iblock_$ID/my_iblock$ID
backends is the emulate_write_cache=1 device attribute.  Windows may act
differently when WriteCacheEnabled=True is not being set.

> 
> After this, on the Windows machine I get 12 new disk drives, and
> format them as NTFS.
> 
> >> I created 12 target(each has one LUN) for 12-cores in this case, still
> >> the performance for both READ and WRITE are about 1/3 comparing to
> >> SCST I got in the past.
> >>
> >
> > Can you send along your rtsadmin/targetcli configuration output in order
> > to get an idea of the setup..?  Also, any other information about the
> > backend configuration + hardware would be useful as well.
> >
> > Also, can you give some specifics on the workload in question..?
> >
> the workload is generated by iometer, I created a 64KB 100% Sequential
> Write and 128KB 100% Sequential READ workloads to all the 12 iSCSI
> disks per worker. After that I duplicate the workers, to 4, 8, 12, for
> example. No matter what I try, the performance is roughly 1/3
> comparing to SCST with similar settings(12 RAID5 iSCSI + iometer)
> 
> For example with SCST, I can easily get wire speed(10Gbps) for READ,
> with LIO I can at most get 3.8Gbps.
> 
> For READ, core0 is 0% idle during test, the rest 11 cores are about
> 80% idle each.
> For WRITE, all 12 cores are 10% idle.

It sounds like all hardware interrupts for the entire system are being
delivered on CPU0, because your system is not running irqbalanced to
distribute the interrupt load across multiple CPUs.

What does your /proc/interrupts output look like..?

So target_core_mod uses a bounded workqueue for it's I/O completion,
which means that process context is provided on the same CPU for which
the hardware interrupt was generated in order to benefit from cache
locality effects.  If all of the hardware interrupts for the entire
system are firing only on CPU0, then only kworkerd/0 is used to provide
process context for queuing the response to the fabric drivers.

If irqbalanced is not running / available on the system, you'll need to
manually set the IRQ affinity using:

   echo $CPU_ID > /proc/irq/$IRQ/smp_affinity_list

So what's important here is that the IRQ vectors for your network card +
storage HBAs are evenly distributed across the available CPUs in the
system.

Also note that these settings are not persistent across restart, so
you'll need to make sure they are explicit set on each boot.

More details are available here:

https://www.kernel.org/doc/Documentation/IRQ-affinity.txt

> 
> Again comparing to SCST, all cores are always nearly evenly
> distributed for computing at both READ and WRITE via iometer.
> 
> >> is LIO-iSCSI on 3.8.x 'best' for 10/100/1G network only? other than
> >> the DEFAULT_CMDSN_DEPTH definition what else I could tune for 10G/40G
> >> iSCSI? Again I am using the same scheduler/ fifo_batch
> >> strip_cache_size read_ahead_kb etc parameters  as I used with SCST,
> >> the only  major difference is LIO vs SCST itself.
> >
> > If your on IB/RoCE/iWARP verbs capable hardware, I'd very much recommend
> > checking out the iser-target that is included in >= v3.10 kernels.
> 
> I have to use 3.8.x for now, and am testing iSCSI/Lio at the moment,
> before moving to FCoE soon.

Sure, but note you'll need to use targetcli/rtslib for configuring FCoE
target ports.

--nab

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html