Re: [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters to MQ

John Garry <john.garry@xxxxxxxxxx> · Fri, 17 Apr 2020 17:46:25 +0100

On 08/04/2020 10:59, Kashyap Desai wrote:

Hi Kashyap,

We have done some level of testing to know performance impact on SAS
SSDs and HDD setup. Here is my finding - My testing used - Two socket
Intel Skylake/Lewisburg/Purley Output of numactl --hardware

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 36 37 38 39
40 41
42 43 44 45 46 47 48 49 50 51 52 53
node 0 size: 31820 MB
node 0 free: 21958 MB
node 1 cpus: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 54
55
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 1 size: 32247 MB
node 1 free: 21068 MB node distances:
node   0   1
    0:  10  21
    1:  21  10

Do you have other info, like IRQ-CPU affinity dump and controller PCI 
vendor+device ID? Also /proc/interrupts info would be good after a run, 
like supplied by Sumit here:

https://lore.kernel.org/linux-scsi/CAL2rwxotoWakFS4DPe85hZ4VAgd_zw8pL+B5ckHR9NwEf+-L=g@xxxxxxxxxxxxxx/

Are you enabling some special driver perf mode?

64 HDD setup -

With higher QD
what's OD?

and io schedulder = mq-deadline, shared host tag is not scaling well. >>> If I use ioscheduler = none, I can see consistent 2.0M IOPs.
This issue is seen only with RFC. Without RFC mq-deadline scales up to
2.0M IOPS.

In theory, from this driver perspective, we should not be making a 
difference. That's after your change to use sdev-> device busy count, 
rather than the hctx nr_active count. As I understand, that's the only 
difference you made.

But I will try an IO scheduler on hisi sas for ssd to see if any difference.

I didn't try any scheduler. I can have a look at that.

Perf Top result of RFC - (IOPS = 1.4M IOPS)

     78.20%  [kernel]        [k] native_queued_spin_lock_slowpath
       1.46%  [kernel]        [k] sbitmap_any_bit_set
       1.14%  [kernel]        [k] blk_mq_run_hw_queue
       0.90%  [kernel]        [k] _mix_pool_bytes
       0.63%  [kernel]        [k] _raw_spin_lock
       0.57%  [kernel]        [k] blk_mq_run_hw_queues
       0.56%  [megaraid_sas]  [k] complete_cmd_fusion
       0.54%  [megaraid_sas]  [k] megasas_build_and_issue_cmd_fusion
       0.50%  [kernel]        [k] dd_has_work
       0.38%  [kernel]        [k] _raw_spin_lock_irqsave
       0.36%  [kernel]        [k] gup_pgd_range
       0.35%  [megaraid_sas]  [k] megasas_build_ldio_fusion
       0.31%  [kernel]        [k] io_submit_one
       0.29%  [kernel]        [k] hctx_lock
       0.26%  [kernel]        [k] try_to_grab_pending
       0.24%  [kernel]        [k] scsi_queue_rq
       0.22%  fio             [.] __fio_gettime
       0.22%  [kernel]        [k] insert_work
       0.20%  [kernel]        [k] native_irq_return_iret

Perf top without RFC driver - (IOPS = 2.0 M IOPS)

      58.40%  [kernel]          [k] native_queued_spin_lock_slowpath
       2.06%  [kernel]          [k] _mix_pool_bytes
       1.38%  [kernel]          [k] _raw_spin_lock_irqsave
       0.97%  [kernel]          [k] _raw_spin_lock
       0.91%  [kernel]          [k] scsi_queue_rq
       0.82%  [kernel]          [k] __sbq_wake_up
       0.77%  [kernel]          [k] _raw_spin_unlock_irqrestore
       0.74%  [kernel]          [k] scsi_mq_get_budget
       0.61%  [kernel]          [k] gup_pgd_range
       0.58%  [kernel]          [k] aio_complete_rw
       0.52%  [kernel]          [k] elv_rb_add
       0.50%  [kernel]          [k] llist_add_batch
       0.50%  [kernel]          [k] native_irq_return_iret
       0.48%  [kernel]          [k] blk_rq_map_sg
       0.48%  fio               [.] __fio_gettime
       0.47%  [kernel]          [k] blk_mq_get_tag
       0.44%  [kernel]          [k] blk_mq_dispatch_rq_list
       0.40%  fio               [.] io_u_queued_complete
       0.39%  fio               [.] get_io_u

If you want me to test any top up patch, please let me know.  BTW, we
also wants to provide module parameter for user to switch back to
older nr_hw_queue = 1 mode. I will work on that part.
ok, but I would just like to reiterate the point that you will not see the
full
benefit of blk-mq draining hw queues for cpu hotplug since you hide hw
queues from blk-mq.
Agree. We have done  minimal testing using this RFC. We want to ACK this RFC
as long as primary performance goal is achieved.

We have done full testing on nr_hw_queue =1 (and that is what customer is
using) so we at least want to give that interface available for customer for
some time (assuming they may find some performance gap between two interface
which we may not have encountered during smoke testing.).
Over a period of time, if nr_hw_queue = N works for (Broadcom will conduct
full performance once RFC is committed in upstream) all the IO profiles, we
will share the information with customer about benefit of using nr_hw_queues
=  N.

Hopefully you can use nr_hw_queues = N always.

Thanks,
john