Re: lio taget iscsi multiple core performance

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Wed, 09 Oct 2013 17:46:45 -0700

On Tue, 2013-10-08 at 19:47 -0500, Xianghua Xiao wrote:
> On Tue, Oct 8, 2013 at 2:45 PM, Nicholas A. Bellinger
> <nab@xxxxxxxxxxxxxxx> wrote:
> > On Mon, 2013-10-07 at 17:32 -0500, Xianghua Xiao wrote:
> >>
> >
> > <SNIP>
> >
> >>
> >> sorry it took me a while to get back to this.
> >>
> >> The test is based on the default LIO-iSCSI from 3.8.13 kernel, and yes
> >> it's a 64 bit PPC box which has 12 cores (24 when hyper-threading is
> >> on).
> >>
> >
> > Great, thanks for the requested information.
> >
> >> I used 12 LIO endpoints(build them from 4 SSDS, each SSD has 12
> >> partitions, I made 12 RAID5s arrays, for each array, I used 4
> >> partitions from each SSD). I will try 24 LIO endpoints once I get
> >> another HBA card that can hook another 4 SSDs.
> >
> > So to confirm, you've got 4 SSDs total, connected to an mpt2sas HBA,
> > right..?
> >
> > Can you please give me the make/models of the 4 SSDs..?
> >
> 
> The 4 SSDs are OCZ Vertor SATA III 256GB running at PCIE 2.5 x4 Lanes.
> We're using MPT2SAS HBA, the same configuration is used for SCST
> testing which gave me 2x~3x performance(READ is at wirespeed, i.e.
> 10G, WRITE is at least 700MB/s), I kind of think the HBA/SSDs are not
> the bottleneck.
> 
> > Are they SAS or SATA SSDs..?  What link speed are they running at..?
> >
> SATA III SSDs.
> 
> >> I did the smp affinity as below:
> >> echo 0 > /proc/irq/410/smp_affinity_list
> >> echo 1 > /proc/irq/408/smp_affinity_list
> >> echo 2 > /proc/irq/406/smp_affinity_list
> >> echo 3 > /proc/irq/404/smp_affinity_list
> >> echo 4 > /proc/irq/402/smp_affinity_list
> >> echo 5 > /proc/irq/400/smp_affinity_list
> >> echo 6 > /proc/irq/398/smp_affinity_list
> >> echo 7 > /proc/irq/396/smp_affinity_list
> >> echo 8 > /proc/irq/394/smp_affinity_list
> >> echo 9 > /proc/irq/392/smp_affinity_list
> >> echo 10 > /proc/irq/390/smp_affinity_list
> >> echo 11 > /proc/irq/388/smp_affinity_list
> >> echo 12 > /proc/irq/386/smp_affinity_list
> >> echo 13 > /proc/irq/384/smp_affinity_list
> >> echo 14 > /proc/irq/174/smp_affinity_list
> >> echo 15 > /proc/irq/172/smp_affinity_list
> >> echo 16 > /proc/irq/170/smp_affinity_list
> >> echo 17 > /proc/irq/168/smp_affinity_list
> >> echo 18 > /proc/irq/166/smp_affinity_list
> >> echo 19 > /proc/irq/164/smp_affinity_list
> >> echo 20 > /proc/irq/162/smp_affinity_list
> >> echo 21 > /proc/irq/160/smp_affinity_list
> >> echo 22 > /proc/irq/158/smp_affinity_list
> >> echo 23 > /proc/irq/156/smp_affinity_list
> >>
> >> I run the iometer READ tests first then run the iometer WRITE tests
> >> after that. For the READ, cpu0 has 0% idle while the rest cores are
> >> relatively idle, also cpu0 got the least interrupts, might be that
> >> it's too busy to process interrupts, thus very bad performance(the
> >> rest cores are just watching).
> >>
> >
> > So based upon your /proc/interrupts, there is a single MSI-X interrupt
> > vector for the mpt2sas driver running on CPU0.
> >
> > Getting a 'perf top -t $PID' of the kworker/0 thread would be helpful to
> > see exactly what's happening on CPU0.
> >
> 
> I will try to get that. It's interesting that WRITE does not have
> similar behaviour, in that all cores are loaded.
> 

Thanks.  Obtaining 'perf top -t $PID' output for both the kworker/0
thread pegged at 100%, plus 'perf top' the entire system would be
extremely useful.

> >> I then run iometer WRITE immediately, all cores are equally busy now
> >> but still not heavily loaded, and all cores are getting very similar
> >> interrupt incremental numbers caused by the WRITE.
> >>
> >> somehow it looks like WRITE is doing the "right" thing, but READ is CPU0 bound.
> >>
> >> For either READ or WRITE, the performance(500MB/s for READ, 300MB/s
> >> for WRITE) is about 1/2~1/3 from what I got with SCST, for SCST I did
> >> not need do the smp_affinity_list settings.
> >>
> >
> > Are you sure that you weren't running in writeback mode with your other
> > tests..?
> >
> > The reason I mention this is because having 12 different partitions on
> > each of the same four SSDs, and running RAID5 across them all means your
> > effectively doing lots of large block random I/O.
> >
> > Given that sequential READ performance for a single SATA 3 Gb/sec SSD is
> > on the order of ~250 MB/sec, best case locally you would be seeing 1000
> > MB/sec (250MB * 4) with purely sequential access.  Cutting this up 12
> > ways per SSD would likely have some effect on performance when operating
> > in write-through mode.
> >
> > So all that said, I'd really like to get some fio large block sequential
> > numbers for the 12 MD RAIDs locally in order to set a baseline for
> > performance.
> >
> 
> I'm certain I'm not doing the writeback mode with SCST, in fact I
> don't know how to set it up as writeback when I'm formatting all iSCSI
> disks as NTFS under iometer/Windows(only used iozone with ext4 mounted
> as writeback).

Note that writeback mode may optionally be, but is not typically enabled
explicitly by the client.  For targets, it usually depends upon the
backend driver in use.  For example, SCST's FILEIO will run in writeback
mode (eg: without O_DSYNC), where incoming writes hit buffer cache
first, and dirty blocks are flushed in the background, or explicitly
flushed by the client using SYCHRONIZE_CACHE or FUA WRITE.

target_core_file.ko has a similar mode for doing this, but is disabled
by default to favor data integrity over performance.

>  Plus, under iometer I'm always driving 2*RAM of disk
> spaces(i.e. 2*12GB=24GB) to make sure we're writing to the disk
> instead of DDR.

A quick way to verify this is to watch 'iostat -xm 3' output, and
compare against the disk / network bandwidth generated by the iometer
workload on the windows client side.

With writeback disabled, these values should be close to identical.

> 
> >> four log files are attached.
> >>
> >> Thanks a lot for the help,
> >>
> >
> > So here's one small patch that changes the completion workqueue from
> > bounded to an unbounded, so that work items queued from CPU0 in the
> > mpt2sas interrupt handler can run on other CPUs.  It essentially
> > sacrifices cache locality (and hence efficiency) in order to run more
> > completions in parallel on different CPUs.
> >
> > diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c
> > index 81e945e..b950b92 100644
> > --- a/drivers/target/target_core_transport.c
> > +++ b/drivers/target/target_core_transport.c
> > @@ -131,7 +131,7 @@ int init_se_kmem_caches(void)
> >         }
> >
> >         target_completion_wq = alloc_workqueue("target_completion",
> > -                                              WQ_MEM_RECLAIM, 0);
> > +                                              WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
> >         if (!target_completion_wq)
> >                 goto out_free_tg_pt_gp_mem_cache;
> >
> >
> > In my testing, it ends up using more CPU for roughly the same (or less)
> > small block random I/O performance.  For large block I/O, it makes no
> > noticeable difference.
> >
> > So all that said, what's I'd still like to see from you before applying
> > this patch is:
> >
> >    - confirmation of writeback mode setting with other tests
> >    - perf top output for the kworker/0 thread
> >    - local fio performance for the 4x SSDs with 12 partitions + 12 MD
> >      RAIDs, for local performance baseline.
> >
> > Please send along this information at your earliest convenience.
> 
> Will work on this.
> 

Great, looking forward to those.

--nab

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html