Re: lio taget iscsi multiple core performance

Xianghua Xiao <xiaoxianghua@xxxxxxxxx> · Wed, 9 Oct 2013 19:30:54 -0500

On Tue, Oct 8, 2013 at 7:47 PM, Xianghua Xiao <xiaoxianghua@xxxxxxxxx> wrote:
> On Tue, Oct 8, 2013 at 2:45 PM, Nicholas A. Bellinger
> <nab@xxxxxxxxxxxxxxx> wrote:
>> On Mon, 2013-10-07 at 17:32 -0500, Xianghua Xiao wrote:
>>>
>>
>> <SNIP>
>>
>>>
>>> sorry it took me a while to get back to this.
>>>
>>> The test is based on the default LIO-iSCSI from 3.8.13 kernel, and yes
>>> it's a 64 bit PPC box which has 12 cores (24 when hyper-threading is
>>> on).
>>>
>>
>> Great, thanks for the requested information.
>>
>>> I used 12 LIO endpoints(build them from 4 SSDS, each SSD has 12
>>> partitions, I made 12 RAID5s arrays, for each array, I used 4
>>> partitions from each SSD). I will try 24 LIO endpoints once I get
>>> another HBA card that can hook another 4 SSDs.
>>
>> So to confirm, you've got 4 SSDs total, connected to an mpt2sas HBA,
>> right..?
>>
>> Can you please give me the make/models of the 4 SSDs..?
>>
>
> The 4 SSDs are OCZ Vertor SATA III 256GB running at PCIE 2.5 x4 Lanes.
> We're using MPT2SAS HBA, the same configuration is used for SCST
> testing which gave me 2x~3x performance(READ is at wirespeed, i.e.
> 10G, WRITE is at least 700MB/s), I kind of think the HBA/SSDs are not
> the bottleneck.
>
>> Are they SAS or SATA SSDs..?  What link speed are they running at..?
>>
> SATA III SSDs.
>
>>> I did the smp affinity as below:
>>> echo 0 > /proc/irq/410/smp_affinity_list
>>> echo 1 > /proc/irq/408/smp_affinity_list
>>> echo 2 > /proc/irq/406/smp_affinity_list
>>> echo 3 > /proc/irq/404/smp_affinity_list
>>> echo 4 > /proc/irq/402/smp_affinity_list
>>> echo 5 > /proc/irq/400/smp_affinity_list
>>> echo 6 > /proc/irq/398/smp_affinity_list
>>> echo 7 > /proc/irq/396/smp_affinity_list
>>> echo 8 > /proc/irq/394/smp_affinity_list
>>> echo 9 > /proc/irq/392/smp_affinity_list
>>> echo 10 > /proc/irq/390/smp_affinity_list
>>> echo 11 > /proc/irq/388/smp_affinity_list
>>> echo 12 > /proc/irq/386/smp_affinity_list
>>> echo 13 > /proc/irq/384/smp_affinity_list
>>> echo 14 > /proc/irq/174/smp_affinity_list
>>> echo 15 > /proc/irq/172/smp_affinity_list
>>> echo 16 > /proc/irq/170/smp_affinity_list
>>> echo 17 > /proc/irq/168/smp_affinity_list
>>> echo 18 > /proc/irq/166/smp_affinity_list
>>> echo 19 > /proc/irq/164/smp_affinity_list
>>> echo 20 > /proc/irq/162/smp_affinity_list
>>> echo 21 > /proc/irq/160/smp_affinity_list
>>> echo 22 > /proc/irq/158/smp_affinity_list
>>> echo 23 > /proc/irq/156/smp_affinity_list
>>>
>>> I run the iometer READ tests first then run the iometer WRITE tests
>>> after that. For the READ, cpu0 has 0% idle while the rest cores are
>>> relatively idle, also cpu0 got the least interrupts, might be that
>>> it's too busy to process interrupts, thus very bad performance(the
>>> rest cores are just watching).
>>>
>>
>> So based upon your /proc/interrupts, there is a single MSI-X interrupt
>> vector for the mpt2sas driver running on CPU0.
>>
>> Getting a 'perf top -t $PID' of the kworker/0 thread would be helpful to
>> see exactly what's happening on CPU0.
>>
>
> I will try to get that. It's interesting that WRITE does not have
> similar behaviour, in that all cores are loaded.
>
>>> I then run iometer WRITE immediately, all cores are equally busy now
>>> but still not heavily loaded, and all cores are getting very similar
>>> interrupt incremental numbers caused by the WRITE.
>>>
>>> somehow it looks like WRITE is doing the "right" thing, but READ is CPU0 bound.
>>>
>>> For either READ or WRITE, the performance(500MB/s for READ, 300MB/s
>>> for WRITE) is about 1/2~1/3 from what I got with SCST, for SCST I did
>>> not need do the smp_affinity_list settings.
>>>
>>
>> Are you sure that you weren't running in writeback mode with your other
>> tests..?
>>
>> The reason I mention this is because having 12 different partitions on
>> each of the same four SSDs, and running RAID5 across them all means your
>> effectively doing lots of large block random I/O.
>>
>> Given that sequential READ performance for a single SATA 3 Gb/sec SSD is
>> on the order of ~250 MB/sec, best case locally you would be seeing 1000
>> MB/sec (250MB * 4) with purely sequential access.  Cutting this up 12
>> ways per SSD would likely have some effect on performance when operating
>> in write-through mode.
>>
>> So all that said, I'd really like to get some fio large block sequential
>> numbers for the 12 MD RAIDs locally in order to set a baseline for
>> performance.
>>
>
> I'm certain I'm not doing the writeback mode with SCST, in fact I
> don't know how to set it up as writeback when I'm formatting all iSCSI
> disks as NTFS under iometer/Windows(only used iozone with ext4 mounted
> as writeback). Plus, under iometer I'm always driving 2*RAM of disk
> spaces(i.e. 2*12GB=24GB) to make sure we're writing to the disk
> instead of DDR.
>
>>> four log files are attached.
>>>
>>> Thanks a lot for the help,
>>>
>>
>> So here's one small patch that changes the completion workqueue from
>> bounded to an unbounded, so that work items queued from CPU0 in the
>> mpt2sas interrupt handler can run on other CPUs.  It essentially
>> sacrifices cache locality (and hence efficiency) in order to run more
>> completions in parallel on different CPUs.
>>
>> diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c
>> index 81e945e..b950b92 100644
>> --- a/drivers/target/target_core_transport.c
>> +++ b/drivers/target/target_core_transport.c
>> @@ -131,7 +131,7 @@ int init_se_kmem_caches(void)
>>         }
>>
>>         target_completion_wq = alloc_workqueue("target_completion",
>> -                                              WQ_MEM_RECLAIM, 0);
>> +                                              WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
>>         if (!target_completion_wq)
>>                 goto out_free_tg_pt_gp_mem_cache;
>>
>>
>> In my testing, it ends up using more CPU for roughly the same (or less)
>> small block random I/O performance.  For large block I/O, it makes no
>> noticeable difference.
>>
>> So all that said, what's I'd still like to see from you before applying
>> this patch is:
>>
>>    - confirmation of writeback mode setting with other tests
>>    - perf top output for the kworker/0 thread
>>    - local fio performance for the 4x SSDs with 12 partitions + 12 MD
>>      RAIDs, for local performance baseline.
>>
>> Please send along this information at your earliest convenience.
>
> Will work on this.
>
> Thanks!
>
>>
>> Thanks,
>>
>> --nab
>>
I don't have everything yet, but I did apply your patch above, and I
also installed another HBA/4-SSDs and am now running 24 Endpoints, in
the hope that I can bind one IRQ for each endpoint to one of the 24
virtual cores(12 cores * 2 via hyperthreading).

perf top -t 4 (4 is the process ID of kworker/0) gave me a blank
output for some reason, I turned on :
++CONFIG_PROFILING=y
++CONFIG_TRACEPOINTS=y
++CONFIG_EVENT_TRACING=y
++CONFIG_TRACING=y
++CONFIG_BLK_DEV_IO_TRACE=y
not sure if I need turn more profiling options in menuconfig, I have
not used perf in the past on PPC.

With 2 HBA/8 SSDs/24 Endpoints via iscsi/iblock, and 24 interrupts to
24 cores, I now have READ the same performance(500MB/s) and WRITE is
increased from 300MB/s to 415MB/s. Again your patch is applied.

I noticed for READ, still CPU0 is 0% idle and the rest 23 cores are
nearly 100% idle. For WRITE, all cores are busy, something like 5%
idle. I'm running deadline IOSCHED for the test, as that's the one I
used with SCST.

Will run fio tests later on.

Will FILEIO provide better performance comparing to IBLOCK? assuming
FILEIO can leverage the filesystem caching? With SCST FILEIO provides
better performance on iscsi, will test that later.

Thanks for the help

Xiao
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html