Re: lio taget iscsi multiple core performance

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Tue, 08 Oct 2013 12:45:08 -0700

On Mon, 2013-10-07 at 17:32 -0500, Xianghua Xiao wrote:
> 

<SNIP>

> 
> sorry it took me a while to get back to this.
> 
> The test is based on the default LIO-iSCSI from 3.8.13 kernel, and yes
> it's a 64 bit PPC box which has 12 cores (24 when hyper-threading is
> on).
> 

Great, thanks for the requested information.

> I used 12 LIO endpoints(build them from 4 SSDS, each SSD has 12
> partitions, I made 12 RAID5s arrays, for each array, I used 4
> partitions from each SSD). I will try 24 LIO endpoints once I get
> another HBA card that can hook another 4 SSDs.

So to confirm, you've got 4 SSDs total, connected to an mpt2sas HBA,
right..?

Can you please give me the make/models of the 4 SSDs..?

Are they SAS or SATA SSDs..?  What link speed are they running at..?

> I did the smp affinity as below:
> echo 0 > /proc/irq/410/smp_affinity_list
> echo 1 > /proc/irq/408/smp_affinity_list
> echo 2 > /proc/irq/406/smp_affinity_list
> echo 3 > /proc/irq/404/smp_affinity_list
> echo 4 > /proc/irq/402/smp_affinity_list
> echo 5 > /proc/irq/400/smp_affinity_list
> echo 6 > /proc/irq/398/smp_affinity_list
> echo 7 > /proc/irq/396/smp_affinity_list
> echo 8 > /proc/irq/394/smp_affinity_list
> echo 9 > /proc/irq/392/smp_affinity_list
> echo 10 > /proc/irq/390/smp_affinity_list
> echo 11 > /proc/irq/388/smp_affinity_list
> echo 12 > /proc/irq/386/smp_affinity_list
> echo 13 > /proc/irq/384/smp_affinity_list
> echo 14 > /proc/irq/174/smp_affinity_list
> echo 15 > /proc/irq/172/smp_affinity_list
> echo 16 > /proc/irq/170/smp_affinity_list
> echo 17 > /proc/irq/168/smp_affinity_list
> echo 18 > /proc/irq/166/smp_affinity_list
> echo 19 > /proc/irq/164/smp_affinity_list
> echo 20 > /proc/irq/162/smp_affinity_list
> echo 21 > /proc/irq/160/smp_affinity_list
> echo 22 > /proc/irq/158/smp_affinity_list
> echo 23 > /proc/irq/156/smp_affinity_list
> 
> I run the iometer READ tests first then run the iometer WRITE tests
> after that. For the READ, cpu0 has 0% idle while the rest cores are
> relatively idle, also cpu0 got the least interrupts, might be that
> it's too busy to process interrupts, thus very bad performance(the
> rest cores are just watching).
> 

So based upon your /proc/interrupts, there is a single MSI-X interrupt
vector for the mpt2sas driver running on CPU0.

Getting a 'perf top -t $PID' of the kworker/0 thread would be helpful to
see exactly what's happening on CPU0.

> I then run iometer WRITE immediately, all cores are equally busy now
> but still not heavily loaded, and all cores are getting very similar
> interrupt incremental numbers caused by the WRITE.
> 
> somehow it looks like WRITE is doing the "right" thing, but READ is CPU0 bound.
> 
> For either READ or WRITE, the performance(500MB/s for READ, 300MB/s
> for WRITE) is about 1/2~1/3 from what I got with SCST, for SCST I did
> not need do the smp_affinity_list settings.
> 

Are you sure that you weren't running in writeback mode with your other
tests..?

The reason I mention this is because having 12 different partitions on
each of the same four SSDs, and running RAID5 across them all means your
effectively doing lots of large block random I/O.

Given that sequential READ performance for a single SATA 3 Gb/sec SSD is
on the order of ~250 MB/sec, best case locally you would be seeing 1000
MB/sec (250MB * 4) with purely sequential access.  Cutting this up 12
ways per SSD would likely have some effect on performance when operating
in write-through mode.

So all that said, I'd really like to get some fio large block sequential
numbers for the 12 MD RAIDs locally in order to set a baseline for
performance.

> four log files are attached.
> 
> Thanks a lot for the help,
> 

So here's one small patch that changes the completion workqueue from
bounded to an unbounded, so that work items queued from CPU0 in the
mpt2sas interrupt handler can run on other CPUs.  It essentially
sacrifices cache locality (and hence efficiency) in order to run more
completions in parallel on different CPUs.

diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c
index 81e945e..b950b92 100644
--- a/drivers/target/target_core_transport.c
+++ b/drivers/target/target_core_transport.c
@@ -131,7 +131,7 @@ int init_se_kmem_caches(void)
        }
 
        target_completion_wq = alloc_workqueue("target_completion",
-                                              WQ_MEM_RECLAIM, 0);
+                                              WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
        if (!target_completion_wq)
                goto out_free_tg_pt_gp_mem_cache;
 

In my testing, it ends up using more CPU for roughly the same (or less)
small block random I/O performance.  For large block I/O, it makes no
noticeable difference.

So all that said, what's I'd still like to see from you before applying
this patch is:

   - confirmation of writeback mode setting with other tests
   - perf top output for the kworker/0 thread
   - local fio performance for the 4x SSDs with 12 partitions + 12 MD 
     RAIDs, for local performance baseline.

Please send along this information at your earliest convenience.

Thanks,

--nab

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html