The IRQ is balanced to all cores(cat /proc/interrupts), the option is turned on via menuconfig. Still the performance is the same. The emulate_write_cache=1 did not help performance either. How does LIO/iscsi handle multi-thread on multi-core system? Thanks, On Thu, Oct 3, 2013 at 3:47 AM, Nicholas A. Bellinger <nab@xxxxxxxxxxxxxxx> wrote: > On Wed, 2013-10-02 at 22:59 -0500, Xianghua Xiao wrote: >> First I use lio-utils instead of targetcli, as this is an embedded box >> that has very limited python packages built-in. >> >> On Wed, Oct 2, 2013 at 5:26 PM, Nicholas A. Bellinger >> <nab@xxxxxxxxxxxxxxx> wrote: >> > On Wed, 2013-10-02 at 14:07 -0500, Xianghua Xiao wrote: >> >> after I changed default_cmdsn_depth to 64 I use iomter to do READ, >> >> only core0 is busy, for WRITE, all cores(12 of them) are equally busy. >> >> >> > >> > Have you been able to isolate the issue down to per session >> > performance..? What happens when the same MD RAID backend is accessed >> > across multiple sessions via a different TargetName+TargetPortalGroupTag >> > endpoint..? Does the performance stay the same..? >> > >> > Also, it would be useful to confirm with a rd_mcp backend to determine >> > if it's something related to the fabric (eg: iscsi) or something related >> > to the backend itself. >> > >> I have 12 RAID5 built from 4 SSDs(each SSD has 8 partitions). Only the >> first two of key steps are shown here: >> tcm_node --block iblock_0/my_iblock0 /dev/md0 >> tcm_node --block iblock_1/my_iblock1 /dev/md1 >> ... >> lio_node --addlun iscsi-test0 1 0 lun_my_block iblock_0/my_iblock0 >> lio_node --addlun iscsi-test1 1 0 lun_my_block iblock_1/my_iblock1 >> ... >> lio_node --addnp iscsi-test0 1 172.16.0.1:3260 >> lio_node --addnp iscsi-test1 1 172.16.0.1:3260 >> ... >> lio_node --enabletpg iscsi-test0 1 >> lio_node --enabletpg iscsi-test1 1 >> ... > > Ok, so you've got 12 individual TargetName+TargetPortalGroupTag > endpoints with a 1:1 LUN mapping, which translates to 12 individual > sessions. More than enough TCP connections to go 10 Gb/sec line rate > for large block I/O. > > One extra thing you'll want to enable on each of iblock_$ID/my_iblock$ID > backends is the emulate_write_cache=1 device attribute. Windows may act > differently when WriteCacheEnabled=True is not being set. > >> >> After this, on the Windows machine I get 12 new disk drives, and >> format them as NTFS. >> >> >> I created 12 target(each has one LUN) for 12-cores in this case, still >> >> the performance for both READ and WRITE are about 1/3 comparing to >> >> SCST I got in the past. >> >> >> > >> > Can you send along your rtsadmin/targetcli configuration output in order >> > to get an idea of the setup..? Also, any other information about the >> > backend configuration + hardware would be useful as well. >> > >> > Also, can you give some specifics on the workload in question..? >> > >> the workload is generated by iometer, I created a 64KB 100% Sequential >> Write and 128KB 100% Sequential READ workloads to all the 12 iSCSI >> disks per worker. After that I duplicate the workers, to 4, 8, 12, for >> example. No matter what I try, the performance is roughly 1/3 >> comparing to SCST with similar settings(12 RAID5 iSCSI + iometer) >> >> For example with SCST, I can easily get wire speed(10Gbps) for READ, >> with LIO I can at most get 3.8Gbps. >> >> For READ, core0 is 0% idle during test, the rest 11 cores are about >> 80% idle each. >> For WRITE, all 12 cores are 10% idle. > > It sounds like all hardware interrupts for the entire system are being > delivered on CPU0, because your system is not running irqbalanced to > distribute the interrupt load across multiple CPUs. > > What does your /proc/interrupts output look like..? > > So target_core_mod uses a bounded workqueue for it's I/O completion, > which means that process context is provided on the same CPU for which > the hardware interrupt was generated in order to benefit from cache > locality effects. If all of the hardware interrupts for the entire > system are firing only on CPU0, then only kworkerd/0 is used to provide > process context for queuing the response to the fabric drivers. > > If irqbalanced is not running / available on the system, you'll need to > manually set the IRQ affinity using: > > echo $CPU_ID > /proc/irq/$IRQ/smp_affinity_list > > So what's important here is that the IRQ vectors for your network card + > storage HBAs are evenly distributed across the available CPUs in the > system. > > Also note that these settings are not persistent across restart, so > you'll need to make sure they are explicit set on each boot. > > More details are available here: > > https://www.kernel.org/doc/Documentation/IRQ-affinity.txt > >> >> Again comparing to SCST, all cores are always nearly evenly >> distributed for computing at both READ and WRITE via iometer. >> >> >> is LIO-iSCSI on 3.8.x 'best' for 10/100/1G network only? other than >> >> the DEFAULT_CMDSN_DEPTH definition what else I could tune for 10G/40G >> >> iSCSI? Again I am using the same scheduler/ fifo_batch >> >> strip_cache_size read_ahead_kb etc parameters as I used with SCST, >> >> the only major difference is LIO vs SCST itself. >> > >> > If your on IB/RoCE/iWARP verbs capable hardware, I'd very much recommend >> > checking out the iser-target that is included in >= v3.10 kernels. >> >> I have to use 3.8.x for now, and am testing iSCSI/Lio at the moment, >> before moving to FCoE soon. > > Sure, but note you'll need to use targetcli/rtslib for configuring FCoE > target ports. > > --nab > -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html