Re: SCSI Performance regression [was Re: [PATCH 0/6] tcm_vhost/virtio-scsi WIP code for-3.6]

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Fri, 06 Jul 2012 11:21:49 -0700

On Fri, 2012-07-06 at 17:49 +0400, James Bottomley wrote:
> On Fri, 2012-07-06 at 02:13 -0700, Nicholas A. Bellinger wrote:
> > On Fri, 2012-07-06 at 09:43 +0400, James Bottomley wrote:
> > > On Thu, 2012-07-05 at 20:01 -0700, Nicholas A. Bellinger wrote:
> > > 

<SNIP>

> > > > This bottleneck has been mentioned by various people (including myself)
> > > > on linux-scsi the last 18 months, and I've proposed that that it be
> > > > discussed at KS-2012 so we can start making some forward progress:
> > > 
> > > Well, no, it hasn't.  You randomly drop things like this into unrelated
> > > email (I suppose that is a mention in strict English construction) but
> > > it's not really enough to get anyone to pay attention since they mostly
> > > stopped reading at the top, if they got that far: most people just go by
> > > subject when wading through threads initially.
> > > 
> > 
> > It most certainly has been made clear to me, numerous times from many
> > people in the Linux/SCSI community that there is a bottleneck for small
> > block random I/O in SCSI core vs. raw Linux/Block, as well as vs. non
> > Linux based SCSI subsystems.
> > 
> > My apologies if mentioning this issue last year at LC 2011 to you
> > privately did not take a tone of a more serious nature, or that
> > proposing a topic for LSF-2012 this year was not a clear enough
> > indication of a problem with SCSI small block random I/O performance.
> > 
> > > But even if anyone noticed, a statement that RHEL6.2 (on a 2.6.32
> > > kernel, which is now nearly three years old) is 25% slower than W2k8R2
> > > on infiniband isn't really going to get anyone excited either
> > > (particularly when you mention OFED, which usually means a stack
> > > replacement on Linux anyway).
> > > 
> > 
> > The specific issue was first raised for .38 where we where able to get
> > most of the interesting high performance LLDs converted to using
> > internal locking methods so that host_lock did not have to be obtained
> > during each ->queuecommand() I/O dispatch, right..?
> > 
> > This has helped a good deal for large multi-lun scsi_host configs that
> > are now running in host-lock less mode, but there is still a large
> > discrepancy single LUN vs. raw struct block_device access even with LLD
> > host_lock less mode enabled.
> > 
> > Now I think the virtio-blk client performance is demonstrating this
> > issue pretty vividly, along with this week's tcm_vhost IBLOCK raw block
> > flash benchmarks that is demonstrate some other yet-to-be determined
> > limitations for virtio-scsi-raw vs. tcm_vhost for this particular fio
> > randrw workload.
> > 
> > > What people might pay attention to is evidence that there's a problem in
> > > 3.5-rc6 (without any OFED crap).  If you're not going to bother
> > > investigating, it has to be in an environment they can reproduce (so
> > > ordinary hardware, not infiniband) otherwise it gets ignored as an
> > > esoteric hardware issue.
> > > 
> > 
> > It's really quite simple for anyone to demonstrate the bottleneck
> > locally on any machine using tcm_loop with raw block flash.  Take a
> > struct block_device backend (like a Fusion IO /dev/fio*) and using
> > IBLOCK and export locally accessible SCSI LUNs via tcm_loop..
> > 
> > Using FIO there is a significant drop for randrw 4k performance between
> > tcm_loop <-> IBLOCK vs. raw struct block device backends.  And no, it's
> > not some type of target IBLOCK or tcm_loop bottleneck, it's a per SCSI
> > LUN limitation for small block random I/Os on the order of ~75K for each
> > SCSI LUN.
> 
> Here, you're saying here that the end to end SCSI stack tops out at
> around 75k iops, which is reasonably respectable if you don't employ any
> mitigation like queue steering and interrupt polling ... what were the
> mitigation techniques in the test you employed by the way?
> 

~75K per SCSI LUN in a multi-lun per host setup is being optimistic btw.
On the other side of the coin, the same pure block device can easily go
~200K per backend.-

For the simplest case with tcm_loop, a struct scsi_cmnd is queued via
cmwq to execute in process context -> submit the backend I/O.  Once
completed from IBLOCK, the I/O is run though a target completion wq, and
completed back to SCSI.

There is no fancy queue steering or interrupt polling going on (at least
not in tcm_loop) because it's a simple virtual SCSI LLD similar to
scsi_debug.

> But previously, you ascribed a performance drop of around 75% on
> virtio-scsi (topping out around 15-20k iops) to this same problem ...
> that doesn't really seem likely.
> 

No.  I ascribed the performance difference between virtio-scsi+tcm_vhost
vs. bare-metal raw block flash to this bottleneck in Linux/SCSI.

It's obvious that virtio-scsi-raw going through QEMU SCSI / block is
having some other shortcomings.

> Here's the rough ranges of concern:
> 
> 10K iops: standard arrays
> 100K iops: modern expensive fast flash drives on 6Gb links
> 1M iops: PCIe NVMexpress like devices
> 
> SCSI should do arrays with no problem at all, so I'd be really concerned
> that it can't make 0-20k iops.  If you push the system and fine tune it,
> SCSI can just about get to 100k iops.  1M iops is still a stretch goal
> for pure block drivers.
> 

1M iops is not a stretch for pure block drivers anymore on commodity
hardwrae.  5 Fusion-IO HBAs + Romley HW can easily go 1M random 4k IOPs
using a pure block driver.

The point is that it would currently take at least 2x the amount of SCSI
LUNs in order to even get close to 1M IOPs with an single LLD driver.
And from the feedback from everyone I've talked to, no one has been able
to make Linux/SCSI go 1M IOPs with any kernel.

--nab

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html