Re: lio taget iscsi multiple core performance

Xianghua Xiao <xiaoxianghua@xxxxxxxxx> · Thu, 3 Oct 2013 09:16:51 -0500

The IRQ is balanced to all cores(cat /proc/interrupts), the option is
turned on via menuconfig. Still the performance is the same.

The emulate_write_cache=1 did not help performance either.

How does LIO/iscsi handle multi-thread on multi-core system?

Thanks,

On Thu, Oct 3, 2013 at 3:47 AM, Nicholas A. Bellinger
<nab@xxxxxxxxxxxxxxx> wrote:
> On Wed, 2013-10-02 at 22:59 -0500, Xianghua Xiao wrote:
>> First I use lio-utils instead of targetcli, as this is an embedded box
>> that has very limited python packages built-in.
>>
>> On Wed, Oct 2, 2013 at 5:26 PM, Nicholas A. Bellinger
>> <nab@xxxxxxxxxxxxxxx> wrote:
>> > On Wed, 2013-10-02 at 14:07 -0500, Xianghua Xiao wrote:
>> >> after I changed default_cmdsn_depth to 64 I use iomter to do READ,
>> >> only core0 is busy, for WRITE, all cores(12 of them) are equally busy.
>> >>
>> >
>> > Have you been able to isolate the issue down to per session
>> > performance..?  What happens when the same MD RAID backend is accessed
>> > across multiple sessions via a different TargetName+TargetPortalGroupTag
>> > endpoint..?  Does the performance stay the same..?
>> >
>> > Also, it would be useful to confirm with a rd_mcp backend to determine
>> > if it's something related to the fabric (eg: iscsi) or something related
>> > to the backend itself.
>> >
>> I have 12 RAID5 built from 4 SSDs(each SSD has 8 partitions). Only the
>> first two of key steps are shown here:
>> tcm_node --block iblock_0/my_iblock0 /dev/md0
>> tcm_node --block iblock_1/my_iblock1 /dev/md1
>> ...
>> lio_node --addlun iscsi-test0 1 0 lun_my_block iblock_0/my_iblock0
>> lio_node --addlun iscsi-test1 1 0 lun_my_block iblock_1/my_iblock1
>> ...
>> lio_node --addnp iscsi-test0 1 172.16.0.1:3260
>> lio_node --addnp iscsi-test1 1 172.16.0.1:3260
>> ...
>> lio_node --enabletpg iscsi-test0 1
>> lio_node --enabletpg iscsi-test1 1
>> ...
>
> Ok, so you've got 12 individual TargetName+TargetPortalGroupTag
> endpoints with a 1:1 LUN mapping, which translates to 12 individual
> sessions.  More than enough TCP connections to go 10 Gb/sec line rate
> for large block I/O.
>
> One extra thing you'll want to enable on each of iblock_$ID/my_iblock$ID
> backends is the emulate_write_cache=1 device attribute.  Windows may act
> differently when WriteCacheEnabled=True is not being set.
>
>>
>> After this, on the Windows machine I get 12 new disk drives, and
>> format them as NTFS.
>>
>> >> I created 12 target(each has one LUN) for 12-cores in this case, still
>> >> the performance for both READ and WRITE are about 1/3 comparing to
>> >> SCST I got in the past.
>> >>
>> >
>> > Can you send along your rtsadmin/targetcli configuration output in order
>> > to get an idea of the setup..?  Also, any other information about the
>> > backend configuration + hardware would be useful as well.
>> >
>> > Also, can you give some specifics on the workload in question..?
>> >
>> the workload is generated by iometer, I created a 64KB 100% Sequential
>> Write and 128KB 100% Sequential READ workloads to all the 12 iSCSI
>> disks per worker. After that I duplicate the workers, to 4, 8, 12, for
>> example. No matter what I try, the performance is roughly 1/3
>> comparing to SCST with similar settings(12 RAID5 iSCSI + iometer)
>>
>> For example with SCST, I can easily get wire speed(10Gbps) for READ,
>> with LIO I can at most get 3.8Gbps.
>>
>> For READ, core0 is 0% idle during test, the rest 11 cores are about
>> 80% idle each.
>> For WRITE, all 12 cores are 10% idle.
>
> It sounds like all hardware interrupts for the entire system are being
> delivered on CPU0, because your system is not running irqbalanced to
> distribute the interrupt load across multiple CPUs.
>
> What does your /proc/interrupts output look like..?
>
> So target_core_mod uses a bounded workqueue for it's I/O completion,
> which means that process context is provided on the same CPU for which
> the hardware interrupt was generated in order to benefit from cache
> locality effects.  If all of the hardware interrupts for the entire
> system are firing only on CPU0, then only kworkerd/0 is used to provide
> process context for queuing the response to the fabric drivers.
>
> If irqbalanced is not running / available on the system, you'll need to
> manually set the IRQ affinity using:
>
>    echo $CPU_ID > /proc/irq/$IRQ/smp_affinity_list
>
> So what's important here is that the IRQ vectors for your network card +
> storage HBAs are evenly distributed across the available CPUs in the
> system.
>
> Also note that these settings are not persistent across restart, so
> you'll need to make sure they are explicit set on each boot.
>
> More details are available here:
>
> https://www.kernel.org/doc/Documentation/IRQ-affinity.txt
>
>>
>> Again comparing to SCST, all cores are always nearly evenly
>> distributed for computing at both READ and WRITE via iometer.
>>
>> >> is LIO-iSCSI on 3.8.x 'best' for 10/100/1G network only? other than
>> >> the DEFAULT_CMDSN_DEPTH definition what else I could tune for 10G/40G
>> >> iSCSI? Again I am using the same scheduler/ fifo_batch
>> >> strip_cache_size read_ahead_kb etc parameters  as I used with SCST,
>> >> the only  major difference is LIO vs SCST itself.
>> >
>> > If your on IB/RoCE/iWARP verbs capable hardware, I'd very much recommend
>> > checking out the iser-target that is included in >= v3.10 kernels.
>>
>> I have to use 3.8.x for now, and am testing iSCSI/Lio at the moment,
>> before moving to FCoE soon.
>
> Sure, but note you'll need to use targetcli/rtslib for configuring FCoE
> target ports.
>
> --nab
>
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html