On Wed, Feb 1, 2017 at 1:13 PM, Hannes Reinecke <hare@xxxxxxx> wrote: > > On 02/01/2017 08:07 AM, Kashyap Desai wrote: >>> >>> -----Original Message----- >>> From: Hannes Reinecke [mailto:hare@xxxxxxx] >>> Sent: Wednesday, February 01, 2017 12:21 PM >>> To: Kashyap Desai; Christoph Hellwig >>> Cc: Martin K. Petersen; James Bottomley; linux-scsi@xxxxxxxxxxxxxxx; >>> Sathya >>> Prakash Veerichetty; PDL-MPT-FUSIONLINUX; Sreekanth Reddy >>> Subject: Re: [PATCH 00/10] mpt3sas: full mq support >>> >>> On 01/31/2017 06:54 PM, Kashyap Desai wrote: >>>>> >>>>> -----Original Message----- >>>>> From: Hannes Reinecke [mailto:hare@xxxxxxx] >>>>> Sent: Tuesday, January 31, 2017 4:47 PM >>>>> To: Christoph Hellwig >>>>> Cc: Martin K. Petersen; James Bottomley; linux-scsi@xxxxxxxxxxxxxxx; >>>> >>>> Sathya >>>>> >>>>> Prakash; Kashyap Desai; mpt-fusionlinux.pdl@xxxxxxxxxxxx >>>>> Subject: Re: [PATCH 00/10] mpt3sas: full mq support >>>>> >>>>> On 01/31/2017 11:02 AM, Christoph Hellwig wrote: >>>>>> >>>>>> On Tue, Jan 31, 2017 at 10:25:50AM +0100, Hannes Reinecke wrote: >>>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> this is a patchset to enable full multiqueue support for the >>>>>>> mpt3sas >>>>> >>>>> driver. >>>>>>> >>>>>>> While the HBA only has a single mailbox register for submitting >>>>>>> commands, it does have individual receive queues per MSI-X >>>>>>> interrupt and as such does benefit from converting it to full >>>>>>> multiqueue >>>> >>>> support. >>>>>> >>>>>> >>>>>> Explanation and numbers on why this would be beneficial, please. >>>>>> We should not need multiple submissions queues for a single register >>>>>> to benefit from multiple completion queues. >>>>>> >>>>> Well, the actual throughput very strongly depends on the blk-mq-sched >>>>> patches from Jens. >>>>> As this is barely finished I didn't post any numbers yet. >>>>> >>>>> However: >>>>> With multiqueue support: >>>>> 4k seq read : io=60573MB, bw=1009.2MB/s, iops=258353, runt= >>> >>> 60021msec >>>>> >>>>> With scsi-mq on 1 queue: >>>>> 4k seq read : io=17369MB, bw=296291KB/s, iops=74072, runt= 60028msec >>>>> So yes, there _is_ a benefit. Hannes, I have created a md raid0 with 4 SAS SSD drives using below command, #mdadm --create /dev/md0 --level=0 --raid-devices=4 /dev/sdg /dev/sdh /dev/sdi /dev/sdj And here is 'mdadm --detail /dev/md0' command output, -------------------------------------------------------------------------------------------------------------------------- /dev/md0: Version : 1.2 Creation Time : Thu Feb 9 14:38:47 2017 Raid Level : raid0 Array Size : 780918784 (744.74 GiB 799.66 GB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent Update Time : Thu Feb 9 14:38:47 2017 State : clean Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0 Chunk Size : 512K Name : host_name UUID : b63f9da7:b7de9a25:6a46ca00:42214e22 Events : 0 Number Major Minor RaidDevice State 0 8 96 0 active sync /dev/sdg 1 8 112 1 active sync /dev/sdh 2 8 144 2 active sync /dev/sdj 3 8 128 3 active sync /dev/sdi ------------------------------------------------------------------------------------------------------------------------------ Then I have used below fio profile to run 4K sequence read operations with nr_hw_queues=1 driver and with nr_hw_queues=24 driver (as my system has two numa node and each with 12 cpus). ----------------------------------------------------- global] ioengine=libaio group_reporting direct=1 rw=read bs=4k allow_mounted_write=0 iodepth=128 runtime=150s [job1] filename=/dev/md0 ----------------------------------------------------- Here are the fio results when nr_hw_queues=1 (i.e. single request queue) with various number of job counts 1JOB 4k read : io=213268MB, bw=1421.8MB/s, iops=363975, runt=150001msec 2JOBs 4k read : io=309605MB, bw=2064.2MB/s, iops=528389, runt=150001msec 4JOBs 4k read : io=281001MB, bw=1873.4MB/s, iops=479569, runt=150002msec 8JOBs 4k read : io=236297MB, bw=1575.2MB/s, iops=403236, runt=150016msec Here are the fio results when nr_hw_queues=24 (i.e. multiple request queue) with various number of job counts 1JOB 4k read : io=95194MB, bw=649852KB/s, iops=162463, runt=150001msec 2JOBs 4k read : io=189343MB, bw=1262.3MB/s, iops=323142, runt=150001msec 4JOBs 4k read : io=314832MB, bw=2098.9MB/s, iops=537309, runt=150001msec 8JOBs 4k read : io=277015MB, bw=1846.8MB/s, iops=472769, runt=150001msec Here we can see that on less number of jobs count, single request queue (nr_hw_queues=1) is giving more IOPs than multi request queues(nr_hw_queues=24). Can you please share your fio profile, so that I can try same thing on my system. Thanks, Sreekanth >>>>> >>>>> >>>>> (Which is actually quite cool, as these tests were done on a SAS3 >>>>> HBA, >>>> >>>> so >>>>> >>>>> we're getting close to the theoretical maximum of 1.2GB/s). >>>>> (Unlike the single-queue case :-) >>>> >>>> >>>> Hannes - >>>> >>>> Can you share detail about setup ? How many drives do you have and how >>>> is connection (enclosure -> drives. ??) ? >>>> To me it looks like current mpt3sas driver might be taking more hit in >>>> spinlock operation (penalty on NUMA arch is more compare to single >>>> core >>>> server) unlike we have in megaraid_sas driver use of shared blk tag. >>>> >>> The tests were done with a single LSI SAS3008 connected to a NetApp E- >>> series (2660), using 4 LUNs under MD-RAID0. >>> >>> Megaraid_sas is even worse here; due to the odd nature of the 'fusion' >>> implementation we're ending up having _two_ sets of tags, making it really >>> hard to use scsi-mq here. >> >> >> Current megaraid_sas as single submission queue exposed to the blk-mq will >> not encounter similar performance issue. >> We may not see significant improvement of performance if we attempt the same >> for megaraid_sas driver. >> We had similar discussion for megaraid_sas and hpsa. >> http://www.spinics.net/lists/linux-scsi/msg101838.html >> >> I am seeing this patch series is similar attempt for mpt3sas..Am I missing >> anything ? >> > No, you don't. That is precisely the case. > > The different here is that mpt3sas is actually exposing hardware capabilities, whereas with megaraid_sas (and hpsa) we're limited by the hardware implementation to a single completion queue shared between HBA and OS. > With mpt3sas we're having per-interrupt completion queues (well, for newer firmware :-) so we can take advantage of scsi-mq. > > (And if someone had done a _proper_ design of the megaraid_sas_fusion thing by exposing several submission and completion queues for megaraid_sas itself instead of bolting the existing megaraid_sas single queue approach ontop of the mpt3sas multiqueue design we could have done the same thing there ... sigh) > >> Megaraid_sas driver just do indexing from blk_tag and fire IO quick enough >> unlike mpt3sas where we have lock contention @driver level as bottleneck. >> >>> (Not that I didn't try; but lacking a proper backend it's really hard to >>> evaluate >>> the benefit of those ... spinning HDDs simply don't cut it here) >>> >>>> I mean " [PATCH 08/10] mpt3sas: lockless command submission for scsi- >>> >>> mq" >>>> >>>> patch is improving performance removing spinlock overhead and >>>> attempting to get request using blk_tags. >>>> Are you seeing performance improvement if you hard code nr_hw_queues >>>> = 1 in below code changes part of "[PATCH 10/10] mpt3sas: scsi-mq >>>> interrupt steering" >>>> >>> No. The numbers posted above are generated with exactly that patch; the >>> first line is running with nr_hw_queues=32 and the second line with >>> nr_hw_queues=1. >> >> >> Thanks Hannes. That clarifies. Can you share <fio> script you have used ? >> >> If my understanding correct, you will see theoretical maximum of 1.2GBp/s >> if you restrict your work load to single numa node. This is just for >> understanding if <mpt3sas> driver spinlocks are adding overhead. We have >> seen such overhead on multi-socket server and it is reasonable to reduce >> lock in mpt3sas driver, but only concern is exposing HBA for multiple >> submission queue to blk-mq is really not required and trying to figure out >> if we have any side effect of doing that. >> > Well, the HBA has per-MSIx completion queues, so I don't see any issues with exposing them. > blk-mq is designed to handle per-CPU queues, so exposing all hardware queues will be beneficial especially in a low-latency context; and as the experiments show, even when connected to an external storage there is a benefit to be had. > > But exposing all queues might even reduce or even resolve your FW Fault status 0x2100 state; with that patch you now have each queue pulling request off the completion queue and updating the reply post host index in parallel, making the situation far more unlikely. > > Cheers, > > Hannes > -- > Dr. Hannes Reinecke zSeries & Storage > hare@xxxxxxx +49 911 74053 688 > SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg > GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)