Re: [PATCH 00/10] mpt3sas: full mq support

Hannes Reinecke <hare@xxxxxxx> · Mon, 13 Feb 2017 14:11:35 +0100

On 02/13/2017 07:15 AM, Sreekanth Reddy wrote:
> On Fri, Feb 10, 2017 at 12:29 PM, Hannes Reinecke <hare@xxxxxxx> wrote:
>> On 02/10/2017 05:43 AM, Sreekanth Reddy wrote:
>>> On Thu, Feb 9, 2017 at 6:42 PM, Hannes Reinecke <hare@xxxxxxx> wrote:
>>>> On 02/09/2017 02:03 PM, Sreekanth Reddy wrote:
>> [ .. ]
>>>>>
>>>>>
>>>>> Hannes,
>>>>>
>>>>> I have created a md raid0 with 4 SAS SSD drives using below command,
>>>>> #mdadm --create /dev/md0 --level=0 --raid-devices=4 /dev/sdg /dev/sdh
>>>>> /dev/sdi /dev/sdj
>>>>>
>>>>> And here is 'mdadm --detail /dev/md0' command output,
>>>>> --------------------------------------------------------------------------------------------------------------------------
>>>>> /dev/md0:
>>>>>         Version : 1.2
>>>>>   Creation Time : Thu Feb  9 14:38:47 2017
>>>>>      Raid Level : raid0
>>>>>      Array Size : 780918784 (744.74 GiB 799.66 GB)
>>>>>    Raid Devices : 4
>>>>>   Total Devices : 4
>>>>>     Persistence : Superblock is persistent
>>>>>
>>>>>     Update Time : Thu Feb  9 14:38:47 2017
>>>>>           State : clean
>>>>>  Active Devices : 4
>>>>> Working Devices : 4
>>>>>  Failed Devices : 0
>>>>>   Spare Devices : 0
>>>>>
>>>>>      Chunk Size : 512K
>>>>>
>>>>>            Name : host_name
>>>>>            UUID : b63f9da7:b7de9a25:6a46ca00:42214e22
>>>>>          Events : 0
>>>>>
>>>>>     Number   Major   Minor   RaidDevice State
>>>>>        0       8       96        0      active sync   /dev/sdg
>>>>>        1       8      112        1      active sync   /dev/sdh
>>>>>        2       8      144        2      active sync   /dev/sdj
>>>>>        3       8      128        3      active sync   /dev/sdi
>>>>> ------------------------------------------------------------------------------------------------------------------------------
>>>>>
>>>>> Then I have used below fio profile to run 4K sequence read operations
>>>>> with nr_hw_queues=1 driver and with nr_hw_queues=24 driver (as my
>>>>> system has two numa node and each with 12 cpus).
>>>>> -----------------------------------------------------
>>>>> global]
>>>>> ioengine=libaio
>>>>> group_reporting
>>>>> direct=1
>>>>> rw=read
>>>>> bs=4k
>>>>> allow_mounted_write=0
>>>>> iodepth=128
>>>>> runtime=150s
>>>>>
>>>>> [job1]
>>>>> filename=/dev/md0
>>>>> -----------------------------------------------------
>>>>>
>>>>> Here are the fio results when nr_hw_queues=1 (i.e. single request
>>>>> queue) with various number of job counts
>>>>> 1JOB 4k read  : io=213268MB, bw=1421.8MB/s, iops=363975, runt=150001msec
>>>>> 2JOBs 4k read : io=309605MB, bw=2064.2MB/s, iops=528389, runt=150001msec
>>>>> 4JOBs 4k read : io=281001MB, bw=1873.4MB/s, iops=479569, runt=150002msec
>>>>> 8JOBs 4k read : io=236297MB, bw=1575.2MB/s, iops=403236, runt=150016msec
>>>>>
>>>>> Here are the fio results when nr_hw_queues=24 (i.e. multiple request
>>>>> queue) with various number of job counts
>>>>> 1JOB 4k read   : io=95194MB, bw=649852KB/s, iops=162463, runt=150001msec
>>>>> 2JOBs 4k read : io=189343MB, bw=1262.3MB/s, iops=323142, runt=150001msec
>>>>> 4JOBs 4k read : io=314832MB, bw=2098.9MB/s, iops=537309, runt=150001msec
>>>>> 8JOBs 4k read : io=277015MB, bw=1846.8MB/s, iops=472769, runt=150001msec
>>>>>
>>>>> Here we can see that on less number of jobs count, single request
>>>>> queue (nr_hw_queues=1) is giving more IOPs than multi request
>>>>> queues(nr_hw_queues=24).
>>>>>
>>>>> Can you please share your fio profile, so that I can try same thing on
>>>>> my system.
>>>>>
>>>> Have you tried with the latest git update from Jens for-4.11/block (or
>>>> for-4.11/next) branch?
>>>
>>> I am using below git repo,
>>>
>>> https://git.kernel.org/cgit/linux/kernel/git/mkp/scsi.git/log/?h=4.11/scsi-queue
>>>
>>> Today I will try with Jens for-4.11/block.
>>>
>> By all means, do.
>>
>>>> I've found that using the mq-deadline scheduler has a noticeable
>>>> performance boost.
>>>>
>>>> The fio job I'm using is essentially the same; you just should make sure
>>>> to specify a 'numjob=' statement in there.
>>>> Otherwise fio will just use a single CPU, which of course leads to
>>>> averse effects in the multiqueue case.
>>>
>>> Yes I am providing 'numjob=' on fio command line as shown below,
>>>
>>> # fio md_fio_profile --numjobs=8 --output=fio_results.txt
>>>
>> Still, it looks as if you'd be using less jobs than you have CPUs.
>> Which means you'll be running into a tag starvation scenario on those
>> CPUs, especially for the small blocksizes.
>> What are the results if you set 'numjobs' to the number of CPUs?
>>
> 
> Hannes,
> 
> Tried on Jens for-4.11/block kernel repo and also set each block PD's
> scheduler as 'mq-deadline', and here is my results for 4K SR on md0
> (raid0 with 4 drives). I have 24 CPUs and so tried even with setting
> numjobs=24.
> 
> fio results when nr_hw_queues=1 (i.e. single request queue) with
> various number of job counts
> 
> 4k read when numjobs=1 : io=215553MB, bw=1437.9MB/s, iops=367874,
> runt=150001msec
> 4k read when numjobs=2 : io=307771MB, bw=2051.9MB/s, iops=525258,
> runt=150001msec
> 4k read when numjobs=4 : io=300382MB, bw=2002.6MB/s, iops=512644,
> runt=150002msec
> 4k read when numjobs=8 : io=320609MB, bw=2137.4MB/s, iops=547162,
> runt=150003msec
> 4k read when numjobs=24: io=275701MB, bw=1837.1MB/s, iops=470510,
> runt=150006msec
> 
> fio results when nr_hw_queues=24 (i.e. multiple request queue) with
> various number of job counts,
> 
> 4k read when numjobs=1 : io=177600MB, bw=1183.2MB/s, iops=303102,
> runt=150001msec
> 4k read when numjobs=2 : io=182416MB, bw=1216.1MB/s, iops=311320,
> runt=150001msec
> 4k read when numjobs=4 : io=347553MB, bw=2316.2MB/s, iops=593149,
> runt=150002msec
> 4k read when numjobs=8 : io=349995MB, bw=2333.3MB/s, iops=597312,
> runt=150003msec
> 4k read when numjobs=24: io=350618MB, bw=2337.4MB/s, iops=598359,
> runt=150007msec
> 
> On less number of jobs single queue performing better. Where as on
> more number of jobs multi-queue is performing better.
> 
Thank you for these numbers. They do very much fit with my results.

So it's as I suspected; with more parallelism we do gain from
multiqueue. And with single-issue processes we do suffer a performance
penalty.

However, I strongly suspect that this is an issue with block-mq itself,
and not so much with mpt3sas.
Reason is that block-mq needs split the tag space into distinct ranges
for each queue, and hence is hitting tag starvation far earlier the more
queues are registered.
block-mq _can_ work around this by moving the issuing process onto
another CPU (and thus use the tagspace from there), but this involved
calling 'schedule' in the hot path. And might well account for the
performance drop here.

I will be doing more tests with a high nr_hw_queue count and a low I/O
issuer count; I really do guess that it's the block-layer which is
performing suboptimal here.
In any case, we will be discussing blk-mq performance at LSF/MM this
year; I will be bringing up the poor single-queue performance there.

At the end of the day, I strongly suspect that every self-respecting
process doing heavy I/O already _is_ multithreaded, so I would not
trying to optimize for the single-queue case.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@xxxxxxx			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)