On Thu, Oct 11, 2018 at 03:07:33PM +0100, John Garry wrote: > On 11/10/2018 14:32, Ming Lei wrote: > > On Thu, Oct 11, 2018 at 02:12:11PM +0100, John Garry wrote: > > > On 11/10/2018 11:15, Christoph Hellwig wrote: > > > > On Thu, Oct 11, 2018 at 10:59:11AM +0100, John Garry wrote: > > > > > > > > > > > blk-mq tags are always per-host (which has actually caused problems for > > > > > > ATA, which is now using its own per-device tags). > > > > > > > > > > > > > > > > So, for example, if Scsi_host.can_queue = 2048 and Scsi_host.nr_hw_queues = > > > > > 16, then rq tags are still in range [0, 2048) for that HBA, i.e. invariant > > > > > on queue count? > > > > > > > > Yes, if can_queue is 2048 you will gets tags from 0..2047. > > > > > > > > > > I should be clear about some things before discussing this further. Our > > > device has 16 hw queues. And each command we send to any queue in the device > > > must have a unique tag across all hw queues for that device, and should be > > > in the range [0, 2048) - it's called an IPTT. So Scsi_host.can_queue = 2048. > > > > Could you describe a bit about IPTT? > > > > IPTT is an "Initiator Tag". It is a tag to map to the context of a hw queue > command. It is related to SAS protocol Initiator Port tag. I think that most > SAS HBAs have a similar concept. > > IPTTs are limited, and must be recycled when an IO completes. Our hw > supports upto 2048. So we have a limit of 2048 commands issued at any point > in time. > > Previously we had been managing IPTT in LLDD, but found rq tag can be used > as IPTT (as in 6/7), to gave a good performance boost. > > > Looks like the 16 hw queues are like reply queues in other drivers, > > such as megara_sas, but given all the 16 reply queues share one tagset, > > so the hw queue number has to be 1 from blk-mq's view. > > > > > > > > However today we only expose a single queue to upper layer (for unrelated > > > LLDD error handling restriction). We hope to expose all 16 queues in future, > > > which is what I meant by "enabling SCSI MQ in the driver". However, with > > > 6/7, this creates a problem, below. > > > > If the tag of each request from all hw queues has to be unique, you > > can't expose all 16 queues. > > Well we can if we generate and manage the IPTT in the LLDD, as we had been > doing. If we want to use the rq tag - which 6/7 is for - then we can't. In theory, you still may generate and manage the IPTT in the LLDD by simply ignoring rq->tag, meantime enabling SCSI_MQ with 16 hw queues. However, not sure how much this way may improve performance, and it may degrade IO perf. If 16 hw queues are exposed to blk-mq, 16*.can_queue requests may be queued to the driver, and allocation & free on the single IPTT pool will become a bottleneck. Per my experiment on host tagset, it might be a good tradeoff to allocate one hw queue for each node to avoid the remote access on dispatch data/requests structure for this case, but your IPTT pool is still shared all CPUs, maybe you can try the smart sbitmap. https://www.spinics.net/lists/linux-scsi/msg117920.html > > > > > > > > > > IFF you device needs different tags for different queues it can use > > > > the blk_mq_unique_tag heper to generate unique global tag. > > > > > > So this helper can't help, as fundamentially the issue is "the tag field in > > > struct request is unique per hardware queue but not all all hw queues". > > > Indeed blk_mq_unique_tag() does give a unique global tag, but cannot be used > > > for the IPTT. > > > > > > OTOH, We could expose 16 queues to upper layer, and drop 6/7, but we found > > > it performs worse. > > > > We discussed this issue before, but not found a good solution yet for > > exposing multiple hw queues to blk-mq. > > I just think that it's unfortunate that enabling blk-mq means that the LLDD > loses this unique tag across all queues in range [0, Scsi_host.can_queue), > so much so that we found performance better by not exposing multiple queues > and continuing to use single rq tag... It isn't a new problem, we discussed it a lot on megaraid_sas which has same situation with yours, you may find it in block list. Kashyap Desai did lots of test on this case. > > > > > However, we still get good performance in case of none scheduler by the > > following patches: > > > > 8824f62246be blk-mq: fail the request in case issue failure > > 6ce3dd6eec11 blk-mq: issue directly if hw queue isn't busy in case of 'none' > > > > I think that these patches would have been included in our testing. I need > to check. Please switch to none io sched in your test, and it is observed that IO perf becomes good on megaraid_sas. Thanks, Ming