Re: lio taget iscsi multiple core performance

Xianghua Xiao <xiaoxianghua@xxxxxxxxx> · Mon, 7 Oct 2013 17:32:26 -0500

On Thu, Oct 3, 2013 at 6:08 PM, Nicholas A. Bellinger
<nab@xxxxxxxxxxxxxxx> wrote:
> On Thu, 2013-10-03 at 16:14 -0500, Xianghua Xiao wrote:
>> On Thu, Oct 3, 2013 at 2:18 PM, Nicholas A. Bellinger
>> <nab@xxxxxxxxxxxxxxx> wrote:
>> > On Thu, 2013-10-03 at 09:16 -0500, Xianghua Xiao wrote:
>> >> The IRQ is balanced to all cores(cat /proc/interrupts), the option is
>> >> turned on via menuconfig. Still the performance is the same.
>> >>
>> >
>> > Please don't top-post.  It makes it annoying to respond to what's
>> > already been said in the thread.
>> >
>> > FYI, there is no kernel option to balance IRQs automatically across
>> > CPUs, it's done via userspace using irqbalanced, or via explicit
>> > settings in /proc/irq/$IRQ/smp_affinity_list.
>> I checked /proc/interrupts and verified all cores are getting interrupts.
>> also CONFIG_IRQ_ALL_CPUS=y
>> >
>
> OK, so CONFIG_IRQ_ALL_CPUS=y means your running on PPC then.
>
> FYI, I see the following bugfix for this logic that does not appear to
> be included in v3.8.x code:
>
> powerpc/mpic: Fix irq distribution problem when MPIC_SINGLE_DEST_CPU
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=e242114afff0a41550e174cd787cdbafd34625de
>
> Not sure if this applies to your setup or not..
>
> Looking at the code in arch/powerpc/sysdev/mpic:mpic_setup_this_cpu():
>
>         /* let the mpic know we want intrs. default affinity is 0xffffffff
>          * until changed via /proc. That's how it's done on x86. If we want
>          * it differently, then we should make sure we also change the default
>          * values of irq_desc[].affinity in irq.c.
>          */
>         if (distribute_irqs && !(mpic->flags & MPIC_SINGLE_DEST_CPU)) {
>                 for (i = 0; i < mpic->num_sources ; i++)
>                         mpic_irq_write(i, MPIC_INFO(IRQ_DESTINATION),
>                                 mpic_irq_read(i, MPIC_INFO(IRQ_DESTINATION)) | msk);
>         }
>
> seems to indicate the default affinity gets set to 0xffffffff, which is
> also not what you want for best results.
>
> You should consider explicitly setting the IRQ affinity of your NIC +
> storage HBAs to individual CPUs, instead of hardware interrupts randomly
> bouncing across all CPUs on the system.
>
> If you give me the /proc/interrupts output, I'll happily give an example
> of how this should look.
>
>> > So, I'd still like to see your /proc/interrupts output in order to
>> > determine the distribution.
>> >
>> > Some top and perf top output would be useful as well to see what
>> > processes and functions are running.
>> >
>
> Again, please provide both the /proc/interrupts + top output, and
> preferably perf top output as well.
>
> It's very helpful to get this output in order to get an idea of what's
> actually going on.
>
>> >> The emulate_write_cache=1 did not help performance either.
>> >>
>> >> How does LIO/iscsi handle multi-thread on multi-core system?
>> >>
>> >
>> > As explained below:
>> >
>> > So target_core_mod uses a bounded workqueue for it's I/O completion,
>> > which means that process context is provided on the same CPU for which
>> > the hardware interrupt was generated in order to benefit from cache
>> > locality effects.  If all of the hardware interrupts for the entire
>> > system are firing only on CPU0, then only kworkerd/0 is used to provide
>> > process context for queuing the response to the fabric drivers.
>> >
>> > Also, I can confirm with v3.11 code that iscsi-target is running dual
>> > port ixgbe line rate (~20 Gb/sec) with large block reads/writes to PCIe
>> > flash, and to ramdisk_mcp backends.
>> >
>> > So that said, I'll need more information about your setup to determine
>> > what's going on.
>>
>> For iSCSI, all cpus are equally busy(verified by 'top') and all cores
>> are getting the same number of interrupts.
>
> I'm confused now.  You said earlier that on large block READs, that CPU0
> was at 100% CPU usage, right..?  What has changed..?
>
> Please send along the information requested in order to see what's going
> on, instead of making me guess over and over again.
>
>>
>> Sigh, I have to stick with 3.8.x kernel for now, this is a non-x86 box
>> so it's hard to upgrade the kernel due to various dependencies.
>>
>
> There is nothing I'm aware of between the v3.8.x and v3.11.x code that
> would effect large block performance.
>
> --nab
>

sorry it took me a while to get back to this.

The test is based on the default LIO-iSCSI from 3.8.13 kernel, and yes
it's a 64 bit PPC box which has 12 cores (24 when hyper-threading is
on).

I used 12 LIO endpoints(build them from 4 SSDS, each SSD has 12
partitions, I made 12 RAID5s arrays, for each array, I used 4
partitions from each SSD). I will try 24 LIO endpoints once I get
another HBA card that can hook another 4 SSDs.

I did the smp affinity as below:
echo 0 > /proc/irq/410/smp_affinity_list
echo 1 > /proc/irq/408/smp_affinity_list
echo 2 > /proc/irq/406/smp_affinity_list
echo 3 > /proc/irq/404/smp_affinity_list
echo 4 > /proc/irq/402/smp_affinity_list
echo 5 > /proc/irq/400/smp_affinity_list
echo 6 > /proc/irq/398/smp_affinity_list
echo 7 > /proc/irq/396/smp_affinity_list
echo 8 > /proc/irq/394/smp_affinity_list
echo 9 > /proc/irq/392/smp_affinity_list
echo 10 > /proc/irq/390/smp_affinity_list
echo 11 > /proc/irq/388/smp_affinity_list
echo 12 > /proc/irq/386/smp_affinity_list
echo 13 > /proc/irq/384/smp_affinity_list
echo 14 > /proc/irq/174/smp_affinity_list
echo 15 > /proc/irq/172/smp_affinity_list
echo 16 > /proc/irq/170/smp_affinity_list
echo 17 > /proc/irq/168/smp_affinity_list
echo 18 > /proc/irq/166/smp_affinity_list
echo 19 > /proc/irq/164/smp_affinity_list
echo 20 > /proc/irq/162/smp_affinity_list
echo 21 > /proc/irq/160/smp_affinity_list
echo 22 > /proc/irq/158/smp_affinity_list
echo 23 > /proc/irq/156/smp_affinity_list

I run the iometer READ tests first then run the iometer WRITE tests
after that. For the READ, cpu0 has 0% idle while the rest cores are
relatively idle, also cpu0 got the least interrupts, might be that
it's too busy to process interrupts, thus very bad performance(the
rest cores are just watching).

I then run iometer WRITE immediately, all cores are equally busy now
but still not heavily loaded, and all cores are getting very similar
interrupt incremental numbers caused by the WRITE.

somehow it looks like WRITE is doing the "right" thing, but READ is CPU0 bound.

For either READ or WRITE, the performance(500MB/s for READ, 300MB/s
for WRITE) is about 1/2~1/3 from what I got with SCST, for SCST I did
not need do the smp_affinity_list settings.

four log files are attached.

Thanks a lot for the help,

xiao
Attachment:
iometer-read-interrupts.log

Description: Binary data
Attachment:
iometer-read-mpstat.log

Description: Binary data
Attachment:
iometer-write-interrupts.log

Description: Binary data
Attachment:
iometer-write-mpstat.log

Description: Binary data