On Tue, 2017-08-01 at 18:50 +0800, Ming Lei wrote: > On Tue, Aug 01, 2017 at 06:17:18PM +0800, Ming Lei wrote: > > How can we get the accurate 'number of requests in progress' efficiently? Hello Ming, How about counting the number of bits that have been set in the tag set? I am aware that these bits can be set and/or cleared concurrently with the dispatch code but that count is probably a good starting point. > > From my test data of mq-deadline on lpfc, the performance is good, > > please see it in cover letter. > > Forget to mention, ctx->list is one per-cpu list and the lock is percpu > lock, so changing to this way shouldn't be a performance issue. Sorry but I don't consider this reply as sufficient. The latency of IB HCA's is significantly lower than that of any FC hardware I ran performance measurements on myself. It's not because this patch series improves performance for lpfc that that guarantees that there won't be a performance regression for ib_srp, ib_iser or any other low-latency initiator driver for which q->depth != 0. Additionally, patch 03/14 most likely introduces a fairness problem. Shouldn't blk_mq_dispatch_rq_from_ctxs() dequeue requests from the per-CPU queues in a round-robin fashion instead of always starting at the first per-CPU queue in hctx->ctx_map? Thanks, Bart.