On Fri, 2012-07-06 at 09:43 +0400, James Bottomley wrote: > On Thu, 2012-07-05 at 20:01 -0700, Nicholas A. Bellinger wrote: > > > So I'm pretty sure this discrepancy is attributed to the small block > > random I/O bottleneck currently present for all Linux/SCSI core LLDs > > regardless of physical or virtual storage fabric. > > > > The SCSI wide host-lock less conversion that happened in .38 code back > > in 2010, and subsequently having LLDs like virtio-scsi convert to run in > > host-lock-less mode have helped to some extent.. But it's still not > > enough.. > > > > Another example where we've been able to prove this bottleneck recently > > is with the following target setup: > > > > *) Intel Romley production machines with 128 GB of DDR-3 memory > > *) 4x FusionIO ioDrive 2 (1.5 TB @ PCI-e Gen2 x2) > > *) Mellanox PCI-exress Gen3 HCA running at 56 gb/sec > > *) Infiniband SRP Target backported to RHEL 6.2 + latest OFED > > > > In this setup using ib_srpt + IBLOCK w/ emulate_write_cache=1 + > > iomemory_vsl export we end up avoiding SCSI core bottleneck on the > > target machine, just as with the tcm_vhost example here for host kernel > > side processing with vhost. > > > > Using Linux IB SRP initiator + Windows Server 2008 R2 SCSI-miniport SRP > > (OFED) Initiator connected to four ib_srpt LUNs, we've observed that > > MSFT SCSI is currently outperforming RHEL 6.2 on the order of ~285K vs. > > ~215K with heavy random 4k WRITE iometer / fio tests. Note this with an > > optimized queue_depth ib_srp client w/ noop I/O schedulering, but is > > still lacking the host_lock-less patches on RHEL 6.2 OFED.. > > > > This bottleneck has been mentioned by various people (including myself) > > on linux-scsi the last 18 months, and I've proposed that that it be > > discussed at KS-2012 so we can start making some forward progress: > > Well, no, it hasn't. You randomly drop things like this into unrelated > email (I suppose that is a mention in strict English construction) but > it's not really enough to get anyone to pay attention since they mostly > stopped reading at the top, if they got that far: most people just go by > subject when wading through threads initially. > It most certainly has been made clear to me, numerous times from many people in the Linux/SCSI community that there is a bottleneck for small block random I/O in SCSI core vs. raw Linux/Block, as well as vs. non Linux based SCSI subsystems. My apologies if mentioning this issue last year at LC 2011 to you privately did not take a tone of a more serious nature, or that proposing a topic for LSF-2012 this year was not a clear enough indication of a problem with SCSI small block random I/O performance. > But even if anyone noticed, a statement that RHEL6.2 (on a 2.6.32 > kernel, which is now nearly three years old) is 25% slower than W2k8R2 > on infiniband isn't really going to get anyone excited either > (particularly when you mention OFED, which usually means a stack > replacement on Linux anyway). > The specific issue was first raised for .38 where we where able to get most of the interesting high performance LLDs converted to using internal locking methods so that host_lock did not have to be obtained during each ->queuecommand() I/O dispatch, right..? This has helped a good deal for large multi-lun scsi_host configs that are now running in host-lock less mode, but there is still a large discrepancy single LUN vs. raw struct block_device access even with LLD host_lock less mode enabled. Now I think the virtio-blk client performance is demonstrating this issue pretty vividly, along with this week's tcm_vhost IBLOCK raw block flash benchmarks that is demonstrate some other yet-to-be determined limitations for virtio-scsi-raw vs. tcm_vhost for this particular fio randrw workload. > What people might pay attention to is evidence that there's a problem in > 3.5-rc6 (without any OFED crap). If you're not going to bother > investigating, it has to be in an environment they can reproduce (so > ordinary hardware, not infiniband) otherwise it gets ignored as an > esoteric hardware issue. > It's really quite simple for anyone to demonstrate the bottleneck locally on any machine using tcm_loop with raw block flash. Take a struct block_device backend (like a Fusion IO /dev/fio*) and using IBLOCK and export locally accessible SCSI LUNs via tcm_loop.. Using FIO there is a significant drop for randrw 4k performance between tcm_loop <-> IBLOCK vs. raw struct block device backends. And no, it's not some type of target IBLOCK or tcm_loop bottleneck, it's a per SCSI LUN limitation for small block random I/Os on the order of ~75K for each SCSI LUN. If anyone has gone actually gone faster than this with any single SCSI LUN on any storage fabric, I would be interested in hearing about your setup. Thanks, --nab -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html