Ming, > I don't understand the motivation of ramp-up/ramp-down, maybe it is just > for fairness among LUNs. Congestion control. Devices have actual, physical limitations that are different from the tag context limitations on the HBA. You don't have that problem on NVMe because (at least for PCIe) the storage device and the controller are one and the same. If you submit 100000 concurrent requests to a SCSI drive that does 100 IOPS, some requests will time out before they get serviced. Consequently we have the ability to raise and lower the queue depth to constrain the amount of requests in flight to a given device at any point in time. Also, devices use BUSY/QUEUE_FULL/TASK_SET_FULL to cause the OS to back off. We frequently see issues where the host can submit burst I/O much faster than the device can de-stage from cache. In that scenario the device reports BUSY/QF/TSF and we will back off so the device gets a chance to recover. If we just let the application submit new I/O without bounds, the system would never actually recover. Note that the actual, physical limitations for how many commands a target can handle are typically much, much lower than the number of tags the HBA can manage. SATA devices can only express 32 concurrent commands. SAS devices typically 128 concurrent commands per port. Arrays differ. If we ignore the RAID controller use case where the controller internally queues and arbitrates commands between many devices, how is submitting 1000 concurrent requests to a device which only has 128 command slots going to work? Some HBAs have special sauce to manage BUSY/QF/TSF, some don't. If we blindly stop restricting the number of I/Os in flight in the ML, we may exceed either the capabilities of what the transport protocol can express or internal device resources. -- Martin K. Petersen Oracle Linux Engineering