On 8/12/21 7:18 PM, Damien Le Moal wrote:
Let me throw in more information related to this. Command duration limits (CDL) and Sequestered commands features are being drafted in SPC/SBC and ACS to give the device better hints than just a on/off high priority bit. I am currently prototyping these features and I am reusing the ioprio interface for that. Here is how this works: 1) The drives exposes a set of command duration limits descriptors (up to 7 for reads and 7 for writes) that define duration limits for a command execution: overall processing time, queuing time and execution time. Each duration time has a policy associated with it that is applied if a command processing exceeds one of the defined time limit: continue, continue but signal limit exceeded, abort. 2) Users can change the drive command duration limits to whatever they need (e.g. change the policies for the limits to get a fast-fail behavior for commands instead of having the drive retry for a long time) 3) When issuing IOs, users (or FSes) can apply a command duration limit descriptor by specifying the IOPRIO_CLASS_DL priority class. The priority level for that class indicates the descriptor to apply to the command. 4) At SCSI/ATA level, read and write commands have 3 bits defined to specify the command descriptor to apply to the command (1 to 7 or 0 for no limit) With that in place, the disk firmware can now make more intelligent decisions on command scheduling to keep performance high at high queue depth without increasing latency for commands that have low duration limits. And based on the policy defined for a limit, this can be a "soft" best-effort optimization by the disk, or a hard one with aborts if the drive decides that what the user is asking for is not possible. CDL can completely replace the existing binary on/off NCQ priority in a more flexible manner as the user can set different duration limits for high vs low priority. E.g. high priority is equivalent to a very short limit while low priority is equivalent to longer or no limits. I think that CDL has the potential for better interactions with cgroups as cgroup controllers can install a set of limits on the drive that fits the controller target policy. E.g., the latency controller can set duration limits and use the IOPRIO_CLASS_DL class to tell the drive the exact latency target to use. In my implementation, I have not yet looked into cgroups integration for CDL though. I am still wondering what the best approach is: defining a new controller or integrating into existing controllers. The former is likely easier than the latter, but having hardware support for existing controllers has the potential to improve them seamlessly without forcing the user to change anything to there application setup. CDL is still in draft state in the specs though. So I will not be sending this yet.
Thanks Damien for having provided this additional information. This is very helpful. I see this as a welcome evolution since the disk firmware has more information than the CPU (e.g. about the disk head position) and hence can make a better decision than an I/O scheduler or cgroup policy.
For the cloud use case, are all disks used to implement disaggregated storage? I'm asking this because in a disaggregated storage setup the I/O submitter runs on another server than the server to which the disks are connected. In such a setup I expect that the I/O priority will be provided from user space instead of being provided by a cgroup.
Bart.