> On 08/04/2020 10:59, Kashyap Desai wrote: > > Hi Kashyap, > > > > >>> We have done some level of testing to know performance impact on SAS > >>> SSDs and HDD setup. Here is my finding - My testing used - Two > >>> socket Intel Skylake/Lewisburg/Purley Output of numactl --hardware > >>> > >>> available: 2 nodes (0-1) > >>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 36 37 38 39 > >>> 40 41 > >>> 42 43 44 45 46 47 48 49 50 51 52 53 > >>> node 0 size: 31820 MB > >>> node 0 free: 21958 MB > >>> node 1 cpus: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 > >>> 54 > >>> 55 > >>> 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 1 size: 32247 > >>> MB node 1 free: 21068 MB node distances: > >>> node 0 1 > >>> 0: 10 21 > >>> 1: 21 10 > > Do you have other info, like IRQ-CPU affinity dump and controller PCI > vendor+device ID? Also /proc/interrupts info would be good after a run, > like supplied by Sumit here: > > https://lore.kernel.org/linux- > scsi/CAL2rwxotoWakFS4DPe85hZ4VAgd_zw8pL+B5ckHR9NwEf+- > L=g@xxxxxxxxxxxxxx/ Controller performance mode is = IOPs which will create 8 extra reply queues. In this case it is 72 online CPU + 8 = 80 reply queue (MSIx) driver will create. First 8 vectors are non-managed and they are mapped to local numa node -1. Here is IRQ-CPU affinity - irq 256, cpu list 18-35,54-71 irq 257, cpu list 18-35,54-71 irq 258, cpu list 18-35,54-71 irq 259, cpu list 18-35,54-71 irq 260, cpu list 18-35,54-71 irq 261, cpu list 18-35,54-71 irq 262, cpu list 18-35,54-71 irq 263, cpu list 18-35,54-71 irq 264, cpu list 18 irq 265, cpu list 19 irq 266, cpu list 20 irq 267, cpu list 21 irq 268, cpu list 22 irq 269, cpu list 23 irq 270, cpu list 24 irq 271, cpu list 25 irq 272, cpu list 26 irq 273, cpu list 27 irq 274, cpu list 28 irq 275, cpu list 29 irq 276, cpu list 30 irq 277, cpu list 31 irq 278, cpu list 32 irq 279, cpu list 33 irq 280, cpu list 34 irq 281, cpu list 35 irq 282, cpu list 54 irq 283, cpu list 55 irq 284, cpu list 56 irq 285, cpu list 57 irq 286, cpu list 58 irq 287, cpu list 59 irq 288, cpu list 60 irq 289, cpu list 61 irq 290, cpu list 62 irq 291, cpu list 63 irq 292, cpu list 64 irq 293, cpu list 65 irq 294, cpu list 66 irq 295, cpu list 67 irq 296, cpu list 68 irq 297, cpu list 69 irq 298, cpu list 70 irq 299, cpu list 71 irq 300, cpu list 0 irq 301, cpu list 1 irq 302, cpu list 2 irq 303, cpu list 3 irq 304, cpu list 4 irq 305, cpu list 5 irq 306, cpu list 6 irq 307, cpu list 7 irq 308, cpu list 8 irq 309, cpu list 9 irq 310, cpu list 10 irq 311, cpu list 11 irq 312, cpu list 12 irq 313, cpu list 13 irq 314, cpu list 14 irq 315, cpu list 15 irq 316, cpu list 16 irq 317, cpu list 17 irq 318, cpu list 36 irq 319, cpu list 37 irq 320, cpu list 38 irq 321, cpu list 39 irq 322, cpu list 40 irq 323, cpu list 41 irq 324, cpu list 42 irq 325, cpu list 43 irq 326, cpu list 44 irq 327, cpu list 45 irq 328, cpu list 46 irq 329, cpu list 47 irq 330, cpu list 48 irq 331, cpu list 49 irq 332, cpu list 50 irq 333, cpu list 51 irq 334, cpu list 52 irq 335, cpu list 53 > > Are you enabling some special driver perf mode? > > >>> > >>> > >>> 64 HDD setup - > >>> > >>> With higher QD > >> what's OD? I mean higher Queue Depth (QD). Higher Queue Depth is required because congestion will happen at Sdev->queue_depth and shost->can_queue level. If outstanding is not hitting per sdev queue depth OR shost can queue depth, we will not see performance issue. > >> > >>> and io schedulder = mq-deadline, shared host tag is not scaling well. > >>> >>> If > I use ioscheduler = none, I can see consistent 2.0M IOPs. > >>> This issue is seen only with RFC. Without RFC mq-deadline scales up to > >>> 2.0M IOPS. > > In theory, from this driver perspective, we should not be making a > difference. That's after your change to use sdev-> device busy count, > rather than the hctx nr_active count. As I understand, that's the only > difference you made. > > But I will try an IO scheduler on hisi sas for ssd to see if any > difference. > > >> I didn't try any scheduler. I can have a look at that. > >> > >>> Perf Top result of RFC - (IOPS = 1.4M IOPS) > >>> > >>> 78.20% [kernel] [k] native_queued_spin_lock_slowpath > >>> 1.46% [kernel] [k] sbitmap_any_bit_set > >>> 1.14% [kernel] [k] blk_mq_run_hw_queue > >>> 0.90% [kernel] [k] _mix_pool_bytes > >>> 0.63% [kernel] [k] _raw_spin_lock > >>> 0.57% [kernel] [k] blk_mq_run_hw_queues > >>> 0.56% [megaraid_sas] [k] complete_cmd_fusion > >>> 0.54% [megaraid_sas] [k] megasas_build_and_issue_cmd_fusion > >>> 0.50% [kernel] [k] dd_has_work > >>> 0.38% [kernel] [k] _raw_spin_lock_irqsave > >>> 0.36% [kernel] [k] gup_pgd_range > >>> 0.35% [megaraid_sas] [k] megasas_build_ldio_fusion > >>> 0.31% [kernel] [k] io_submit_one > >>> 0.29% [kernel] [k] hctx_lock > >>> 0.26% [kernel] [k] try_to_grab_pending > >>> 0.24% [kernel] [k] scsi_queue_rq > >>> 0.22% fio [.] __fio_gettime > >>> 0.22% [kernel] [k] insert_work > >>> 0.20% [kernel] [k] native_irq_return_iret > >>> > >>> Perf top without RFC driver - (IOPS = 2.0 M IOPS) > >>> > >>> 58.40% [kernel] [k] native_queued_spin_lock_slowpath > >>> 2.06% [kernel] [k] _mix_pool_bytes > >>> 1.38% [kernel] [k] _raw_spin_lock_irqsave > >>> 0.97% [kernel] [k] _raw_spin_lock > >>> 0.91% [kernel] [k] scsi_queue_rq > >>> 0.82% [kernel] [k] __sbq_wake_up > >>> 0.77% [kernel] [k] _raw_spin_unlock_irqrestore > >>> 0.74% [kernel] [k] scsi_mq_get_budget > >>> 0.61% [kernel] [k] gup_pgd_range > >>> 0.58% [kernel] [k] aio_complete_rw > >>> 0.52% [kernel] [k] elv_rb_add > >>> 0.50% [kernel] [k] llist_add_batch > >>> 0.50% [kernel] [k] native_irq_return_iret > >>> 0.48% [kernel] [k] blk_rq_map_sg > >>> 0.48% fio [.] __fio_gettime > >>> 0.47% [kernel] [k] blk_mq_get_tag > >>> 0.44% [kernel] [k] blk_mq_dispatch_rq_list > >>> 0.40% fio [.] io_u_queued_complete > >>> 0.39% fio [.] get_io_u > >>> > >>> > >>> If you want me to test any top up patch, please let me know. BTW, we > >>> also wants to provide module parameter for user to switch back to > >>> older nr_hw_queue = 1 mode. I will work on that part. > >> ok, but I would just like to reiterate the point that you will not see > >> the > >> full > >> benefit of blk-mq draining hw queues for cpu hotplug since you hide hw > >> queues from blk-mq. > > Agree. We have done minimal testing using this RFC. We want to ACK this > RFC > > as long as primary performance goal is achieved. > > > > We have done full testing on nr_hw_queue =1 (and that is what customer > > is > > using) so we at least want to give that interface available for customer > > for > > some time (assuming they may find some performance gap between two > interface > > which we may not have encountered during smoke testing.). > > Over a period of time, if nr_hw_queue = N works for (Broadcom will > > conduct > > full performance once RFC is committed in upstream) all the IO profiles, > > we > > will share the information with customer about benefit of using > nr_hw_queues > > = N. > > Hopefully you can use nr_hw_queues = N always. > > > > > Thanks, > john