RE: SCSI mid layer and high IOPS capable devices

"Jack Wang" <jack_wang@xxxxxxxxx> · Fri, 14 Dec 2012 12:59:49 +0800

Steve,

Thanks for share detail of your problem.

Yes you ?re right about test I talk. Now I know what you want to discuss on
this thread.

Jack

Right, but if I understand you correctly, you're ganging up 24 device queues
and measuring aggregate iops across them all.  That is, you have 24 SAS
disks all presented individually to the OS, right? (or did the controller
aggregate them all into 1 logical drive presented to the OS?)

I'm talking about one very low latency single device capable of let's say
450k iops all by itself.  The problem is that with the scsi mid layer in
this case, there can only be a single request queue feeding that one device
(unlike your 24 request queues feeding 24 devices.)  That single request
queue is essentially single threaded -- only one cpu can touch it at a time
to add or remove a request from it.  With the block layer's make_request
interface, I can take advantage of parallelism in the low level block driver
and get essentially a queue per cpu feeding the single device.  With the
scsi mid layer, the low level driver's queue per cpu is (if I am correct)
throttled by the fact that what is feeding those lld queues is one
(essentially) single threaded request queue.  It doesn't matter that the
scsi LLD has a twelve lane highway leading into it because the scsi midlayer
has a 1 lane highway feeding into that 12 lane highway.  If I understand you
correctly, you get 800k iops by measuring 24 highways going to 24 different
towns.  I have one town and one highway.  The part of my highway that I
control can handle several hundred kiops, but the part I don't control
seemingly cannot.

That is why scsi_debug driver can't get very high iops on a single
pseudo-device, because there's only one request queue and that queue is
protected by a spin lock.  perf shows contention on spin locks in
scsi_request_fn()  -- large percentage of cpu time spent trying to get spin
locks in scsi_request_fn().  I forget the exact number right now, but iirc,
it was something like 30-40%.

That is sort of the whole problem I'm having, as best I understand it, and
why I started this thread.   And unfortunately I do not have any very good
ideas about what to do about it, other than use the block layer's make
request interface, which is not ideal for a number of reasons (e.g. people
and software (grub, etc.) are very much accustomed to dealing with the sd
driver, and all other things being equal, using the sd driver interface is
very much preferable.)

With flash based storage devices, the age old assumptions that "disks" are
glacially slow compared to the cpu(s) and seek penalties exist and are to be
avoided which underlie the design of the linux storage subsystem
architecture are starting to become false.  That's kind of the "big picture"
view of the problem.

Part of me thinks what we really ought to do is make the non-volatile
storage look like RAM at the hardware level, more or less, then put a ramfs
on top of it, and call it done (there are probably myriad reasons it's not
that simple of which I'm ignorant.)

-- steve

On Thu, Dec 13, 2012 at 7:41 PM, Jack Wang <jack_wang@xxxxxxxxx> wrote:

Maybe, and good to know for real-world scenarios, but scsi-debug with
fake_rw=1 isn't even actually doing the i/o.  I would think sequential,
random, whatever wouldn't matter in that case, because presumably, it's not
even looking at the LBAs, much less acting on them, nor would I expect the
no-op i/o scheduler to be affected by the LBAs.

-- steve
For read world hardware, I tested with next generation PMCS SAS controller
with 24 SAS disks, 512 sequential read with more than 800K , 512 sequential
write with more than 500K
similar results with windows 2008, but SATA performance did worse than
windows
kernel is 3.2.x as I remembered.
Jack

On Thu, Dec 13, 2012 at 6:22 PM, Jack Wang <jack_wang@xxxxxxxxx> wrote:
On 12/13/12 18:25, scameron@xxxxxxxxxxxxxxxxxx wrote:
> On Thu, Dec 13, 2012 at 04:22:33PM +0100, Bart Van Assche wrote:
>> On 12/11/12 01:00, scameron@xxxxxxxxxxxxxxxxxx wrote:
>>> The driver, like nvme, has a submit and reply queue per cpu.
>>
>> This is interesting. If my interpretation of the POSIX spec is
>> correct then aio_write() allows to queue overlapping writes and all
>> writes submitted by the same thread have to be performed in the order
>> they were submitted by that thread. What if a thread submits a first
>> write via aio_write(), gets rescheduled on another CPU and submits a
>> second overlapping write also via aio_write() ? If a block driver
>> uses one queue per CPU, does that mean that such writes that were
>> issued in order can be executed in a different order by the driver
>> and/or hardware than the order in which the writes were submitted ?
>>
>> See also the aio_write() man page, The Open Group Base Specifications
>> Issue 7, IEEE Std 1003.1-2008
>>
(http://pubs.opengroup.org/onlinepubs/9699919799/functions/aio_write.html).
>
> It is my understanding that the low level driver is free to re-order
> the i/o's any way it wants, as is the hardware.  It is up to the
> layers above to enforce any ordering requirements.  For a long time
> there was a bug in the cciss driver that all i/o's submitted to the
> driver got reversed in order -- adding to head of a list instead of to
> the tail, or vice versa, I forget which -- and it caused no real
> problems (apart from some slight performance issues that were mostly
masked by the Smart Array's cache.
> It was caught by firmware guys noticing LBAs coming in in weird orders
> for supposedly sequential workloads.
>
> So in your scenario, I think the overlapping writes should not be
> submitted by the block layer to the low level driver concurrently, as
> the block layer is aware that the lld is free to re-order things.  (I
> am very certain that this is the case for scsi low level drivers and
> block drivers using a request_fn interface -- less certain about block
> drivers using the make_request interface to submit i/o's, as this
> interface is pretty new to me.

As far as I know there are basically two choices:
1. Allow the LLD to reorder any pair of write requests. The only way
    for higher layers to ensure the order of (overlapping) writes is then
    to separate these in time. Or in other words, limit write request
    queue depth to one.
2. Do not allow the LLD to reorder overlapping write requests. This
    allows higher software layers to queue write requests (queue depth
    > 1).

 From my experience with block and SCSI drivers option (1) doesn't look
attractive from a performance point of view. From what I have seen
performance with QD=1 is several times lower than performance with QD > 1.
But maybe I overlooked something ?

Bart.
I was seen low queue depth improve sequential performance, and high queue
depth improve random performance.

Jack

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the
body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at
http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html