On 1/27/23 02:33, Bart Van Assche wrote: > On 1/26/23 05:53, Niklas Cassel wrote: >> On Thu, Jan 26, 2023 at 09:24:12AM +0900, Damien Le Moal wrote: >>> But again, the difficulty with this overloading is that we *cannot* implement a >>> solid level-based scheduling in IO schedulers because ordering the CDLs in a >>> meaningful way is impossible. So BFQ handling of the RT class would likely not >>> result in the most ideal scheduling (that would depend heavily on how the CDL >>> descriptors are defined on the drive). Hence my reluctance to overload the RT >>> class for CDL. >> >> Well, if CDL were to reuse IOPRIO_CLASS_RT, then the user would either have to >> disable the IO scheduler, so that lower classdata levels wouldn't be prioritized >> over higher classdata levels, or simply use an IO scheduler that does not care >> about the classdata level, e.g. mq-deadline. > > How about making the information about whether or not CDL has been > enabled available to the scheduler such that the scheduler can include > that information in its decisions? Sure, that is easy to do. But as I mentioned before, I think that is something we can do after this initial support series. >> However, for CDL, things are not as simple as setting a single bit in the >> command, because of all the different descriptors, so we must let the classdata >> represent the device side priority level, and not the host side priority level >> (as we cannot have both, and I agree with you, it is very hard define an order >> between the descriptors.. e.g. should a 20 ms policy 0xf descriptor be ranked >> higher or lower than a 20 ms policy 0xd descriptor?). > > How about only supporting a subset of the standard such that it becomes > easy to map CDLs to host side priority levels? I am opposed to this, for several reasons: 1) We are seeing different use cases from users that cover a wide range of use of CDL descriptors with various definitions. 2) Passthrough commands can be used by a user to change a drive CDL descriptors without the kernel knowing about it, unless we spend our time revalidating the CDL descriptor log page(s)... 3) CDL standard as is is actually very sensible and not overloaded with stuff that is only useful in niche use cases. For each CDL descriptor, you have: * The active time limit, which is a clean way to specify how much time you allow a drive to deal with bad sectors (mostly read case). A typical HDD will try very hard to recover data from a sector, always. As a result, the HDD may spend up to several seconds reading a sector again and again applying different signal processing techniques until it gets the sector ECC checked to return valid data. That of course can hugely increase an IO latency seen by the host. In applications such as erasure coded distributed object stores, maximum latency for an object access can thus be kept low using this limit without compromising the data since the object can always be rebuilt from the erasure codes if one HDD is slow to respond. This limit is also interesting for video streaming/playback to avoid video buffer underflow (at the expense of may be some block noise depending on the codec). * The inactive time limit can be used to tell the drive how long it is allowed to let a command stand in the drive internal queue before processing. This is thus a parameter that allows a host to tune the drive RPO optimization (rotational positioning optimization, e.g. HDD internal command scheduling based on angular sector position on tracks withe the head current position). This is a neat way to control max IOPS vs tail latency since drives tend to privilege maximizing IOPS over lowering max tail latency. * The duration guideline limit defines an overall time limit for a command without distinguishing between active and inactive time. It is the easiest to use (the easiest one to understand from a beginner user point of view). This is a neat way to define an intelligent IO prioritization in fact, way better than RT class scheduling on the host or the use of ATA NCQ high priority, as it provides more information to the drive about the urgency of a particular command. That allows the drive to still perform RPO to maximize IOPS without long tail latencies. Chaining such limit with an active+inactive time limit descriptor using the "next limit" policy (0x1 policy) can also finely define what the drive should if the guideline limit is exceeded (as the next descriptor can define what to do based on the reason for the limit being exceeded: long internal queueing vs bad sector long access time). > If users really need the ability to use all standardized CDL features > and if there is no easy way to map CDL levels to an I/O priority, is the > I/O priority mechanism really the best basis for a user space interface > for CDLs? As you can see above, yes, we need everything and should not attempt restricting CDL use. The IO priority interface is a perfect fit for CDL in the sense that all we need to pass along from user to device is one number: the CDL index to use for a command. So creating a different interface for this while the IO priority interface exactly does that sounds silly to me. One compromise we could do is: have the IO schedulers completely ignore CDL prio class for now, that is, have them assume that no IO prio class/level was specified. Given that they are not tuned to handle CDL well anyway, this is probably the best thing to do for now. We still need to have the block layer prevent merging of requests with different CDL descriptors though, which is another reason to reuse the IO prio interface as the block layer already does this. Less code, which is always a good thing. > > Thanks, > > Bart. -- Damien Le Moal Western Digital Research