On Sun, Jun 02, 2019 at 02:42:02PM +0800, Ming Lei wrote: > Hi Kashyap, > > Thanks for your test. > > On Sun, Jun 02, 2019 at 03:11:26AM +0530, Kashyap Desai wrote: > > > SCSI's reply qeueue is very similar with blk-mq's hw queue, both > > assigned by > > > IRQ vector, so map te private reply queue into blk-mq's hw queue via > > > .host_tagset. > > > > > > Then the private reply mapping can be removed. > > > > > > Another benefit is that the request/irq lost issue may be solved in > > generic > > > approach because managed IRQ may be shutdown during CPU hotplug. > > > > Ming, > > > > I quickly tested this patch series on MegaRaid Aero controller. Without > > this patch I can get 3.0M IOPS, but once I apply this patch I see only > > 1.2M IOPS (40% Performance drop) > > HBA supports 5089 can_queue. > > > > <perf top> output without patch - > > > > 3.39% [megaraid_sas] [k] complete_cmd_fusion > > 3.36% [kernel] [k] scsi_queue_rq > > 3.26% [kernel] [k] entry_SYSCALL_64 > > 2.57% [kernel] [k] syscall_return_via_sysret > > 1.95% [megaraid_sas] [k] megasas_build_and_issue_cmd_fusion > > 1.88% [kernel] [k] _raw_spin_lock_irqsave > > 1.79% [kernel] [k] gup_pmd_range > > 1.73% [kernel] [k] _raw_spin_lock > > 1.68% [kernel] [k] __sched_text_start > > 1.19% [kernel] [k] irq_entries_start > > 1.13% [kernel] [k] scsi_dec_host_busy > > 1.08% [kernel] [k] aio_complete > > 1.07% [kernel] [k] read_tsc > > 1.01% [kernel] [k] blk_mq_get_request > > 0.93% [kernel] [k] __update_load_avg_cfs_rq > > 0.92% [kernel] [k] aio_read_events > > 0.91% [kernel] [k] lookup_ioctx > > 0.91% fio [.] fio_gettime > > 0.87% [kernel] [k] set_next_entity > > 0.87% [megaraid_sas] [k] megasas_build_ldio_fusion > > > > <perf top> output with patch - > > > > 11.30% [kernel] [k] native_queued_spin_lock_slowpath > > I guess there must be one global lock required in megaraid submission path, > could you run 'perf record -g -a' to see which lock is and what the stack > trace is? Meantime please try the following patch and see if difference can be made. diff --git a/block/blk-mq.c b/block/blk-mq.c index 49d73d979cb3..d2abec3b0f60 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -589,7 +589,7 @@ static void __blk_mq_complete_request(struct request *rq) * So complete IO reqeust in softirq context in case of single queue * for not degrading IO performance by irqsoff latency. */ - if (q->nr_hw_queues == 1) { + if (q->nr_hw_queues == 1 || (rq->mq_hctx->flags & BLK_MQ_F_HOST_TAGS)) { __blk_complete_request(rq); return; } @@ -1977,7 +1977,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) /* bypass scheduler for flush rq */ blk_insert_flush(rq); blk_mq_run_hw_queue(data.hctx, true); - } else if (plug && (q->nr_hw_queues == 1 || q->mq_ops->commit_rqs)) { + } else if (plug && (q->nr_hw_queues == 1 || q->mq_ops->commit_rqs || + (data.hctx->flags & BLK_MQ_F_HOST_TAGS))) { /* * Use plugging if we have a ->commit_rqs() hook as well, as * we know the driver uses bd->last in a smart fashion. thanks, Ming