On Tue, Oct 8, 2013 at 2:45 PM, Nicholas A. Bellinger <nab@xxxxxxxxxxxxxxx> wrote: > On Mon, 2013-10-07 at 17:32 -0500, Xianghua Xiao wrote: >> > > <SNIP> > >> >> sorry it took me a while to get back to this. >> >> The test is based on the default LIO-iSCSI from 3.8.13 kernel, and yes >> it's a 64 bit PPC box which has 12 cores (24 when hyper-threading is >> on). >> > > Great, thanks for the requested information. > >> I used 12 LIO endpoints(build them from 4 SSDS, each SSD has 12 >> partitions, I made 12 RAID5s arrays, for each array, I used 4 >> partitions from each SSD). I will try 24 LIO endpoints once I get >> another HBA card that can hook another 4 SSDs. > > So to confirm, you've got 4 SSDs total, connected to an mpt2sas HBA, > right..? > > Can you please give me the make/models of the 4 SSDs..? > The 4 SSDs are OCZ Vertor SATA III 256GB running at PCIE 2.5 x4 Lanes. We're using MPT2SAS HBA, the same configuration is used for SCST testing which gave me 2x~3x performance(READ is at wirespeed, i.e. 10G, WRITE is at least 700MB/s), I kind of think the HBA/SSDs are not the bottleneck. > Are they SAS or SATA SSDs..? What link speed are they running at..? > SATA III SSDs. >> I did the smp affinity as below: >> echo 0 > /proc/irq/410/smp_affinity_list >> echo 1 > /proc/irq/408/smp_affinity_list >> echo 2 > /proc/irq/406/smp_affinity_list >> echo 3 > /proc/irq/404/smp_affinity_list >> echo 4 > /proc/irq/402/smp_affinity_list >> echo 5 > /proc/irq/400/smp_affinity_list >> echo 6 > /proc/irq/398/smp_affinity_list >> echo 7 > /proc/irq/396/smp_affinity_list >> echo 8 > /proc/irq/394/smp_affinity_list >> echo 9 > /proc/irq/392/smp_affinity_list >> echo 10 > /proc/irq/390/smp_affinity_list >> echo 11 > /proc/irq/388/smp_affinity_list >> echo 12 > /proc/irq/386/smp_affinity_list >> echo 13 > /proc/irq/384/smp_affinity_list >> echo 14 > /proc/irq/174/smp_affinity_list >> echo 15 > /proc/irq/172/smp_affinity_list >> echo 16 > /proc/irq/170/smp_affinity_list >> echo 17 > /proc/irq/168/smp_affinity_list >> echo 18 > /proc/irq/166/smp_affinity_list >> echo 19 > /proc/irq/164/smp_affinity_list >> echo 20 > /proc/irq/162/smp_affinity_list >> echo 21 > /proc/irq/160/smp_affinity_list >> echo 22 > /proc/irq/158/smp_affinity_list >> echo 23 > /proc/irq/156/smp_affinity_list >> >> I run the iometer READ tests first then run the iometer WRITE tests >> after that. For the READ, cpu0 has 0% idle while the rest cores are >> relatively idle, also cpu0 got the least interrupts, might be that >> it's too busy to process interrupts, thus very bad performance(the >> rest cores are just watching). >> > > So based upon your /proc/interrupts, there is a single MSI-X interrupt > vector for the mpt2sas driver running on CPU0. > > Getting a 'perf top -t $PID' of the kworker/0 thread would be helpful to > see exactly what's happening on CPU0. > I will try to get that. It's interesting that WRITE does not have similar behaviour, in that all cores are loaded. >> I then run iometer WRITE immediately, all cores are equally busy now >> but still not heavily loaded, and all cores are getting very similar >> interrupt incremental numbers caused by the WRITE. >> >> somehow it looks like WRITE is doing the "right" thing, but READ is CPU0 bound. >> >> For either READ or WRITE, the performance(500MB/s for READ, 300MB/s >> for WRITE) is about 1/2~1/3 from what I got with SCST, for SCST I did >> not need do the smp_affinity_list settings. >> > > Are you sure that you weren't running in writeback mode with your other > tests..? > > The reason I mention this is because having 12 different partitions on > each of the same four SSDs, and running RAID5 across them all means your > effectively doing lots of large block random I/O. > > Given that sequential READ performance for a single SATA 3 Gb/sec SSD is > on the order of ~250 MB/sec, best case locally you would be seeing 1000 > MB/sec (250MB * 4) with purely sequential access. Cutting this up 12 > ways per SSD would likely have some effect on performance when operating > in write-through mode. > > So all that said, I'd really like to get some fio large block sequential > numbers for the 12 MD RAIDs locally in order to set a baseline for > performance. > I'm certain I'm not doing the writeback mode with SCST, in fact I don't know how to set it up as writeback when I'm formatting all iSCSI disks as NTFS under iometer/Windows(only used iozone with ext4 mounted as writeback). Plus, under iometer I'm always driving 2*RAM of disk spaces(i.e. 2*12GB=24GB) to make sure we're writing to the disk instead of DDR. >> four log files are attached. >> >> Thanks a lot for the help, >> > > So here's one small patch that changes the completion workqueue from > bounded to an unbounded, so that work items queued from CPU0 in the > mpt2sas interrupt handler can run on other CPUs. It essentially > sacrifices cache locality (and hence efficiency) in order to run more > completions in parallel on different CPUs. > > diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c > index 81e945e..b950b92 100644 > --- a/drivers/target/target_core_transport.c > +++ b/drivers/target/target_core_transport.c > @@ -131,7 +131,7 @@ int init_se_kmem_caches(void) > } > > target_completion_wq = alloc_workqueue("target_completion", > - WQ_MEM_RECLAIM, 0); > + WQ_MEM_RECLAIM | WQ_UNBOUND, 0); > if (!target_completion_wq) > goto out_free_tg_pt_gp_mem_cache; > > > In my testing, it ends up using more CPU for roughly the same (or less) > small block random I/O performance. For large block I/O, it makes no > noticeable difference. > > So all that said, what's I'd still like to see from you before applying > this patch is: > > - confirmation of writeback mode setting with other tests > - perf top output for the kworker/0 thread > - local fio performance for the 4x SSDs with 12 partitions + 12 MD > RAIDs, for local performance baseline. > > Please send along this information at your earliest convenience. Will work on this. Thanks! > > Thanks, > > --nab > -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html