Re: lio taget iscsi multiple core performance

Xianghua Xiao <xiaoxianghua@xxxxxxxxx> · Tue, 8 Oct 2013 19:47:04 -0500

On Tue, Oct 8, 2013 at 2:45 PM, Nicholas A. Bellinger
<nab@xxxxxxxxxxxxxxx> wrote:
> On Mon, 2013-10-07 at 17:32 -0500, Xianghua Xiao wrote:
>>
>
> <SNIP>
>
>>
>> sorry it took me a while to get back to this.
>>
>> The test is based on the default LIO-iSCSI from 3.8.13 kernel, and yes
>> it's a 64 bit PPC box which has 12 cores (24 when hyper-threading is
>> on).
>>
>
> Great, thanks for the requested information.
>
>> I used 12 LIO endpoints(build them from 4 SSDS, each SSD has 12
>> partitions, I made 12 RAID5s arrays, for each array, I used 4
>> partitions from each SSD). I will try 24 LIO endpoints once I get
>> another HBA card that can hook another 4 SSDs.
>
> So to confirm, you've got 4 SSDs total, connected to an mpt2sas HBA,
> right..?
>
> Can you please give me the make/models of the 4 SSDs..?
>

The 4 SSDs are OCZ Vertor SATA III 256GB running at PCIE 2.5 x4 Lanes.
We're using MPT2SAS HBA, the same configuration is used for SCST
testing which gave me 2x~3x performance(READ is at wirespeed, i.e.
10G, WRITE is at least 700MB/s), I kind of think the HBA/SSDs are not
the bottleneck.

> Are they SAS or SATA SSDs..?  What link speed are they running at..?
>
SATA III SSDs.

>> I did the smp affinity as below:
>> echo 0 > /proc/irq/410/smp_affinity_list
>> echo 1 > /proc/irq/408/smp_affinity_list
>> echo 2 > /proc/irq/406/smp_affinity_list
>> echo 3 > /proc/irq/404/smp_affinity_list
>> echo 4 > /proc/irq/402/smp_affinity_list
>> echo 5 > /proc/irq/400/smp_affinity_list
>> echo 6 > /proc/irq/398/smp_affinity_list
>> echo 7 > /proc/irq/396/smp_affinity_list
>> echo 8 > /proc/irq/394/smp_affinity_list
>> echo 9 > /proc/irq/392/smp_affinity_list
>> echo 10 > /proc/irq/390/smp_affinity_list
>> echo 11 > /proc/irq/388/smp_affinity_list
>> echo 12 > /proc/irq/386/smp_affinity_list
>> echo 13 > /proc/irq/384/smp_affinity_list
>> echo 14 > /proc/irq/174/smp_affinity_list
>> echo 15 > /proc/irq/172/smp_affinity_list
>> echo 16 > /proc/irq/170/smp_affinity_list
>> echo 17 > /proc/irq/168/smp_affinity_list
>> echo 18 > /proc/irq/166/smp_affinity_list
>> echo 19 > /proc/irq/164/smp_affinity_list
>> echo 20 > /proc/irq/162/smp_affinity_list
>> echo 21 > /proc/irq/160/smp_affinity_list
>> echo 22 > /proc/irq/158/smp_affinity_list
>> echo 23 > /proc/irq/156/smp_affinity_list
>>
>> I run the iometer READ tests first then run the iometer WRITE tests
>> after that. For the READ, cpu0 has 0% idle while the rest cores are
>> relatively idle, also cpu0 got the least interrupts, might be that
>> it's too busy to process interrupts, thus very bad performance(the
>> rest cores are just watching).
>>
>
> So based upon your /proc/interrupts, there is a single MSI-X interrupt
> vector for the mpt2sas driver running on CPU0.
>
> Getting a 'perf top -t $PID' of the kworker/0 thread would be helpful to
> see exactly what's happening on CPU0.
>

I will try to get that. It's interesting that WRITE does not have
similar behaviour, in that all cores are loaded.

>> I then run iometer WRITE immediately, all cores are equally busy now
>> but still not heavily loaded, and all cores are getting very similar
>> interrupt incremental numbers caused by the WRITE.
>>
>> somehow it looks like WRITE is doing the "right" thing, but READ is CPU0 bound.
>>
>> For either READ or WRITE, the performance(500MB/s for READ, 300MB/s
>> for WRITE) is about 1/2~1/3 from what I got with SCST, for SCST I did
>> not need do the smp_affinity_list settings.
>>
>
> Are you sure that you weren't running in writeback mode with your other
> tests..?
>
> The reason I mention this is because having 12 different partitions on
> each of the same four SSDs, and running RAID5 across them all means your
> effectively doing lots of large block random I/O.
>
> Given that sequential READ performance for a single SATA 3 Gb/sec SSD is
> on the order of ~250 MB/sec, best case locally you would be seeing 1000
> MB/sec (250MB * 4) with purely sequential access.  Cutting this up 12
> ways per SSD would likely have some effect on performance when operating
> in write-through mode.
>
> So all that said, I'd really like to get some fio large block sequential
> numbers for the 12 MD RAIDs locally in order to set a baseline for
> performance.
>

I'm certain I'm not doing the writeback mode with SCST, in fact I
don't know how to set it up as writeback when I'm formatting all iSCSI
disks as NTFS under iometer/Windows(only used iozone with ext4 mounted
as writeback). Plus, under iometer I'm always driving 2*RAM of disk
spaces(i.e. 2*12GB=24GB) to make sure we're writing to the disk
instead of DDR.

>> four log files are attached.
>>
>> Thanks a lot for the help,
>>
>
> So here's one small patch that changes the completion workqueue from
> bounded to an unbounded, so that work items queued from CPU0 in the
> mpt2sas interrupt handler can run on other CPUs.  It essentially
> sacrifices cache locality (and hence efficiency) in order to run more
> completions in parallel on different CPUs.
>
> diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c
> index 81e945e..b950b92 100644
> --- a/drivers/target/target_core_transport.c
> +++ b/drivers/target/target_core_transport.c
> @@ -131,7 +131,7 @@ int init_se_kmem_caches(void)
>         }
>
>         target_completion_wq = alloc_workqueue("target_completion",
> -                                              WQ_MEM_RECLAIM, 0);
> +                                              WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
>         if (!target_completion_wq)
>                 goto out_free_tg_pt_gp_mem_cache;
>
>
> In my testing, it ends up using more CPU for roughly the same (or less)
> small block random I/O performance.  For large block I/O, it makes no
> noticeable difference.
>
> So all that said, what's I'd still like to see from you before applying
> this patch is:
>
>    - confirmation of writeback mode setting with other tests
>    - perf top output for the kworker/0 thread
>    - local fio performance for the 4x SSDs with 12 partitions + 12 MD
>      RAIDs, for local performance baseline.
>
> Please send along this information at your earliest convenience.

Will work on this.

Thanks!

>
> Thanks,
>
> --nab
>
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html