Hi Martin, On Thu, Nov 21, 2019 at 09:59:53PM -0500, Martin K. Petersen wrote: > > Ming, > > > I don't understand the motivation of ramp-up/ramp-down, maybe it is just > > for fairness among LUNs. > > Congestion control. Devices have actual, physical limitations that are > different from the tag context limitations on the HBA. You don't have > that problem on NVMe because (at least for PCIe) the storage device and > the controller are one and the same. > > If you submit 100000 concurrent requests to a SCSI drive that does 100 > IOPS, some requests will time out before they get serviced. > Consequently we have the ability to raise and lower the queue depth to > constrain the amount of requests in flight to a given device at any > point in time. blk-mq has already puts a limit on each LUN, the number is host_queue_depth / nr_active_LUNs, see hctx_may_queue(). Looks this way works for NVMe, that is why I try to bypass .device_busy for SSD which is too expensive on fast storage. Even Hannes wants to kill it completely. > > Also, devices use BUSY/QUEUE_FULL/TASK_SET_FULL to cause the OS to back > off. We frequently see issues where the host can submit burst I/O much > faster than the device can de-stage from cache. In that scenario the > device reports BUSY/QF/TSF and we will back off so the device gets a > chance to recover. If we just let the application submit new I/O without > bounds, the system would never actually recover. > > Note that the actual, physical limitations for how many commands a > target can handle are typically much, much lower than the number of tags > the HBA can manage. SATA devices can only express 32 concurrent > commands. SAS devices typically 128 concurrent commands per > port. Arrays differ. I understand SATA's host queue depth is set as 32. But SAS HBA's queue depth is often big, so do we reply on .device_busy for throttling requests to SAS? > > If we ignore the RAID controller use case where the controller > internally queues and arbitrates commands between many devices, how is > submitting 1000 concurrent requests to a device which only has 128 > command slots going to work? For SSD, I guess it might be fine, given NVMe sets per-hw-queue depth as 1023 usually. That means the concurrent requests can be as many as 1023 * nr_hw_queues in case of single namespace. > > Some HBAs have special sauce to manage BUSY/QF/TSF, some don't. If we > blindly stop restricting the number of I/Os in flight in the ML, we may > exceed either the capabilities of what the transport protocol can > express or internal device resources. OK, one conservative approach may be just to just bypass .device_busy in case of SSD only for some high end HBA. Or maybe we can wire up sdev->queue_depth with block layer's scheduler queue depth? One issue is that sdev->queue_depth may be updated some times. Thanks, Ming