On 1/28/23 02:23, Bart Van Assche wrote: > A summary of my concerns is as follows: > * The current I/O priority levels (RT, BE, IDLE) apply to all block > devices. IOPRIO_CLASS_DL is only supported by certain block devices > (some but not all SCSI harddisks). This forces applications to check the > capabilities of the storage device before it can be decided whether or > not IOPRIO_CLASS_DL can be used. This is not something applications > should do but something the kernel should do. Additionally, if multiple > dm devices are stacked on top of the block device driver, like in > Android, it becomes even more cumbersome to check whether or not the > block device supports CDL. Yes, RT, BE and IDLE apply to all block devices. And so does CDL in the sense that if a user specifies the CDL class for IOs to a device that does not support CDL, then nothing special will happen. There will be no differentiation of the IOs. That *exactly* what happens when using RT, BE or IDLE with the none scheduler (e.g. default nvme setup). And the same remark applies to RT class mapping to ATA NCQ priority feature: the user needs to check the device to know if that will happen, *and* also needs to turn on that feature for that mapping to be effective. The levels of the CDL priority class are also very well defined: they map to the CDL descriptors defined on the drive, which are consultable by the user through sysfs (no special tools needed), so easily discoverable. As for DM devices, these have no scheduler. So any processing of a priority class by a DM target driver is that driver responsibility. Initially, all that happens is the block layer passing on that information through the stack with the BIOs. That's it. Real action may happen once the physical block device is reached with the IO scheduler for that device, if one is set. At that level, none scheduler is of no concern, nothing will happen. Kyber also ignores priorities. We are left with only bfq and mq-deadline. The latter only cares about the priority class, ignoring levels. bfq does act on both class and level. IOPRIO_CLASS_DL is equal to 4, so strictly speaking, is of lower priority than the IDLE class if you want to consider it as part of that ordering. But we defined it as a different class to allow *not* having to do that. IO schedulers can be modified to ignore that priority class for now, mapping it to say the default BE class for instance. Our current patch set maps the CDL class to the RT class for the schedulers, as that made most sense given the time-sensitive nature of CDL workloads. But we can change that to actually let the scheduler decide if you want. There are no other changes in the block layer that have or need special handling of the CDL class. All very clean in my opinion, no special conditions for that feature. No additional "if" in the hot path, no overhead added. > * For the RT, BE and IDLE classes, it is well defined which priority > number represents a high priority and which priority number represents a > low priority. For CDL, only the drive knows the priority details. I > think that application software should be able to select a DL priority > without having to read the CDL configuration first. The levels of the CDL priority class are also very well defined: they map to the CDL descriptors defined on the drive, which are consultable by the user through sysfs (no special tools needed), so easily discoverable. And unless we restrict how CDL descriptors can be defined, which I explained in my previous email is not desirable at all, we cannot and should not try to order levels in some sort of priority semantic. CDL semantic does not define directly a priority level, only time limits, which may or may not be ordered, depending on the limits definitions. As Niklas pointed out, this is not a "generic" feature that any random application can magically use without modifications. The application must be aware of what CDL is and if how the descriptors are. And for 99.99% of the use cases, the CDL descriptors will be defined in a way usefull for that application. There is no magic generic set of descriptors defined by default. Though a simple set of increasing time limits that can be cleanly mapped to priority levels. A system administrator is free to do that for the system drives if that is what the running applications expect. CDL is a very flexible feature that can cover a lot of use cases. Trying to shoehorn in into the legacy/classic priority semantic framework would only restrict its usefulness. > I hope that I have it made it clear that I think that the proposed user > space API will be very painful to use for application developers. I completely disagree. Reusing the prio class/level API made it easy to allow applications to use the feature. fio support for CDL requires exactly *one line* change, to allow for the CDL class number 4. That's it. From there, one can use the --cmdprio_class=4 nd --cmdprio=idx options to exercise a drive. The value of "idx" here of course depends on how the descriptors are set on the drive. But back to the point above. This depends on the application goals and the descriptors are set accordingly for that goal. There is no real discovery needed by the application. The application expect a certain set of CDL limits for its use case, and checking that this set is the one currently defined on the drive is easy to do from an application with the sysfs interface we added. Many users out there have deployed and using applications taking advantage of ATA NCQ priority feature, using class RT for high priority IOs. The new CDL class does not require many application changes to be enabled for next gen drives that will have CDL. > > Bart. > -- Damien Le Moal Western Digital Research