Re: [PATCH 00/10] mpt3sas: full mq support

Sreekanth Reddy <sreekanth.reddy@xxxxxxxxxxxx> · Thu, 9 Feb 2017 18:33:30 +0530

On Wed, Feb 1, 2017 at 1:13 PM, Hannes Reinecke <hare@xxxxxxx> wrote:
>
> On 02/01/2017 08:07 AM, Kashyap Desai wrote:
>>>
>>> -----Original Message-----
>>> From: Hannes Reinecke [mailto:hare@xxxxxxx]
>>> Sent: Wednesday, February 01, 2017 12:21 PM
>>> To: Kashyap Desai; Christoph Hellwig
>>> Cc: Martin K. Petersen; James Bottomley; linux-scsi@xxxxxxxxxxxxxxx;
>>> Sathya
>>> Prakash Veerichetty; PDL-MPT-FUSIONLINUX; Sreekanth Reddy
>>> Subject: Re: [PATCH 00/10] mpt3sas: full mq support
>>>
>>> On 01/31/2017 06:54 PM, Kashyap Desai wrote:
>>>>>
>>>>> -----Original Message-----
>>>>> From: Hannes Reinecke [mailto:hare@xxxxxxx]
>>>>> Sent: Tuesday, January 31, 2017 4:47 PM
>>>>> To: Christoph Hellwig
>>>>> Cc: Martin K. Petersen; James Bottomley; linux-scsi@xxxxxxxxxxxxxxx;
>>>>
>>>> Sathya
>>>>>
>>>>> Prakash; Kashyap Desai; mpt-fusionlinux.pdl@xxxxxxxxxxxx
>>>>> Subject: Re: [PATCH 00/10] mpt3sas: full mq support
>>>>>
>>>>> On 01/31/2017 11:02 AM, Christoph Hellwig wrote:
>>>>>>
>>>>>> On Tue, Jan 31, 2017 at 10:25:50AM +0100, Hannes Reinecke wrote:
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> this is a patchset to enable full multiqueue support for the
>>>>>>> mpt3sas
>>>>>
>>>>> driver.
>>>>>>>
>>>>>>> While the HBA only has a single mailbox register for submitting
>>>>>>> commands, it does have individual receive queues per MSI-X
>>>>>>> interrupt and as such does benefit from converting it to full
>>>>>>> multiqueue
>>>>
>>>> support.
>>>>>>
>>>>>>
>>>>>> Explanation and numbers on why this would be beneficial, please.
>>>>>> We should not need multiple submissions queues for a single register
>>>>>> to benefit from multiple completion queues.
>>>>>>
>>>>> Well, the actual throughput very strongly depends on the blk-mq-sched
>>>>> patches from Jens.
>>>>> As this is barely finished I didn't post any numbers yet.
>>>>>
>>>>> However:
>>>>> With multiqueue support:
>>>>> 4k seq read : io=60573MB, bw=1009.2MB/s, iops=258353, runt=
>>>
>>> 60021msec
>>>>>
>>>>> With scsi-mq on 1 queue:
>>>>> 4k seq read : io=17369MB, bw=296291KB/s, iops=74072, runt= 60028msec
>>>>> So yes, there _is_ a benefit.

Hannes,

I have created a md raid0 with 4 SAS SSD drives using below command,
#mdadm --create /dev/md0 --level=0 --raid-devices=4 /dev/sdg /dev/sdh
/dev/sdi /dev/sdj

And here is 'mdadm --detail /dev/md0' command output,
--------------------------------------------------------------------------------------------------------------------------
/dev/md0:
        Version : 1.2
  Creation Time : Thu Feb  9 14:38:47 2017
     Raid Level : raid0
     Array Size : 780918784 (744.74 GiB 799.66 GB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

    Update Time : Thu Feb  9 14:38:47 2017
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

     Chunk Size : 512K

           Name : host_name
           UUID : b63f9da7:b7de9a25:6a46ca00:42214e22
         Events : 0

    Number   Major   Minor   RaidDevice State
       0       8       96        0      active sync   /dev/sdg
       1       8      112        1      active sync   /dev/sdh
       2       8      144        2      active sync   /dev/sdj
       3       8      128        3      active sync   /dev/sdi
------------------------------------------------------------------------------------------------------------------------------

Then I have used below fio profile to run 4K sequence read operations
with nr_hw_queues=1 driver and with nr_hw_queues=24 driver (as my
system has two numa node and each with 12 cpus).
-----------------------------------------------------
global]
ioengine=libaio
group_reporting
direct=1
rw=read
bs=4k
allow_mounted_write=0
iodepth=128
runtime=150s

[job1]
filename=/dev/md0
-----------------------------------------------------

Here are the fio results when nr_hw_queues=1 (i.e. single request
queue) with various number of job counts
1JOB 4k read  : io=213268MB, bw=1421.8MB/s, iops=363975, runt=150001msec
2JOBs 4k read : io=309605MB, bw=2064.2MB/s, iops=528389, runt=150001msec
4JOBs 4k read : io=281001MB, bw=1873.4MB/s, iops=479569, runt=150002msec
8JOBs 4k read : io=236297MB, bw=1575.2MB/s, iops=403236, runt=150016msec

Here are the fio results when nr_hw_queues=24 (i.e. multiple request
queue) with various number of job counts
1JOB 4k read   : io=95194MB, bw=649852KB/s, iops=162463, runt=150001msec
2JOBs 4k read : io=189343MB, bw=1262.3MB/s, iops=323142, runt=150001msec
4JOBs 4k read : io=314832MB, bw=2098.9MB/s, iops=537309, runt=150001msec
8JOBs 4k read : io=277015MB, bw=1846.8MB/s, iops=472769, runt=150001msec

Here we can see that on less number of jobs count, single request
queue (nr_hw_queues=1) is giving more IOPs than multi request
queues(nr_hw_queues=24).

Can you please share your fio profile, so that I can try same thing on
my system.

Thanks,
Sreekanth

>>>>>
>>>>>
>>>>> (Which is actually quite cool, as these tests were done on a SAS3
>>>>> HBA,
>>>>
>>>> so
>>>>>
>>>>> we're getting close to the theoretical maximum of 1.2GB/s).
>>>>> (Unlike the single-queue case :-)
>>>>
>>>>
>>>> Hannes -
>>>>
>>>> Can you share detail about setup ? How many drives do you have and how
>>>> is connection (enclosure -> drives. ??) ?
>>>> To me it looks like current mpt3sas driver might be taking more hit in
>>>> spinlock operation (penalty on NUMA arch is more compare to single
>>>> core
>>>> server) unlike we have in megaraid_sas driver use of shared blk tag.
>>>>
>>> The tests were done with a single LSI SAS3008 connected to a NetApp E-
>>> series (2660), using 4 LUNs under MD-RAID0.
>>>
>>> Megaraid_sas is even worse here; due to the odd nature of the 'fusion'
>>> implementation we're ending up having _two_ sets of tags, making it really
>>> hard to use scsi-mq here.
>>
>>
>> Current megaraid_sas as single submission queue exposed to the blk-mq will
>> not encounter similar performance issue.
>> We may not see significant improvement of performance if we attempt the same
>> for megaraid_sas driver.
>> We had similar discussion for megaraid_sas and hpsa.
>> http://www.spinics.net/lists/linux-scsi/msg101838.html
>>
>> I am seeing this patch series is similar attempt for mpt3sas..Am I missing
>> anything ?
>>
> No, you don't. That is precisely the case.
>
> The different here is that mpt3sas is actually exposing hardware capabilities, whereas with megaraid_sas (and hpsa) we're limited by the hardware implementation to a single completion queue shared between HBA and OS.
> With mpt3sas we're having per-interrupt completion queues (well, for newer firmware :-) so we can take advantage of scsi-mq.
>
> (And if someone had done a _proper_ design of the megaraid_sas_fusion thing by exposing several submission and completion queues for megaraid_sas itself instead of bolting the existing megaraid_sas single queue approach ontop of the mpt3sas multiqueue design we could have done the same thing there ... sigh)
>
>> Megaraid_sas driver just do indexing from blk_tag and fire IO quick enough
>> unlike mpt3sas where we have  lock contention @driver level as bottleneck.
>>
>>> (Not that I didn't try; but lacking a proper backend it's really hard to
>>> evaluate
>>> the benefit of those ... spinning HDDs simply don't cut it here)
>>>
>>>> I mean " [PATCH 08/10] mpt3sas: lockless command submission for scsi-
>>>
>>> mq"
>>>>
>>>> patch is improving performance removing spinlock overhead and
>>>> attempting to get request using blk_tags.
>>>> Are you seeing performance improvement  if you hard code nr_hw_queues
>>>> = 1 in below code changes part of "[PATCH 10/10] mpt3sas: scsi-mq
>>>> interrupt steering"
>>>>
>>> No. The numbers posted above are generated with exactly that patch; the
>>> first line is running with nr_hw_queues=32 and the second line with
>>> nr_hw_queues=1.
>>
>>
>> Thanks Hannes. That clarifies.  Can you share <fio> script you have used ?
>>
>> If my  understanding correct, you will see theoretical maximum of 1.2GBp/s
>> if you restrict your work load to single numa node. This is just for
>> understanding if <mpt3sas> driver spinlocks are adding overhead. We have
>> seen such overhead on multi-socket server and it is reasonable to reduce
>> lock in mpt3sas driver, but only concern is exposing HBA for multiple
>> submission queue to blk-mq is really not required and trying to figure out
>> if we have any side effect of doing that.
>>
> Well, the HBA has per-MSIx completion queues, so I don't see any issues with exposing them.
> blk-mq is designed to handle per-CPU queues, so exposing all hardware queues will be beneficial especially in a low-latency context; and as the experiments show, even when connected to an external storage there is a benefit to be had.
>
> But exposing all queues might even reduce or even resolve your FW Fault status 0x2100 state; with that patch you now have each queue pulling request off the completion queue and updating the reply post host index in parallel, making the situation far more unlikely.
>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke                   zSeries & Storage
> hare@xxxxxxx                          +49 911 74053 688
> SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)