Re: [PATCH v3 01/18] block: introduce duration-limits priority class

Damien Le Moal <damien.lemoal@xxxxxxxxxxxxxxxxxx> · Fri, 27 Jan 2023 09:18:10 +0900

On 1/27/23 02:33, Bart Van Assche wrote:
> On 1/26/23 05:53, Niklas Cassel wrote:
>> On Thu, Jan 26, 2023 at 09:24:12AM +0900, Damien Le Moal wrote:
>>> But again, the difficulty with this overloading is that we *cannot* implement a
>>> solid level-based scheduling in IO schedulers because ordering the CDLs in a
>>> meaningful way is impossible. So BFQ handling of the RT class would likely not
>>> result in the most ideal scheduling (that would depend heavily on how the CDL
>>> descriptors are defined on the drive). Hence my reluctance to overload the RT
>>> class for CDL.
>>
>> Well, if CDL were to reuse IOPRIO_CLASS_RT, then the user would either have to
>> disable the IO scheduler, so that lower classdata levels wouldn't be prioritized
>> over higher classdata levels, or simply use an IO scheduler that does not care
>> about the classdata level, e.g. mq-deadline.
> 
> How about making the information about whether or not CDL has been 
> enabled available to the scheduler such that the scheduler can include 
> that information in its decisions?

Sure, that is easy to do. But as I mentioned before, I think that is
something we can do after this initial support series.

>> However, for CDL, things are not as simple as setting a single bit in the
>> command, because of all the different descriptors, so we must let the classdata
>> represent the device side priority level, and not the host side priority level
>> (as we cannot have both, and I agree with you, it is very hard define an order
>> between the descriptors.. e.g. should a 20 ms policy 0xf descriptor be ranked
>> higher or lower than a 20 ms policy 0xd descriptor?).
> 
> How about only supporting a subset of the standard such that it becomes 
> easy to map CDLs to host side priority levels?

I am opposed to this, for several reasons:

1) We are seeing different use cases from users that cover a wide range of
use of CDL descriptors with various definitions.

2) Passthrough commands can be used by a user to change a drive CDL
descriptors without the kernel knowing about it, unless we spend our time
revalidating the CDL descriptor log page(s)...

3) CDL standard as is is actually very sensible and not overloaded with
stuff that is only useful in niche use cases. For each CDL descriptor, you
have:
 * The active time limit, which is a clean way to specify how much time
you allow a drive to deal with bad sectors (mostly read case). A typical
HDD will try very hard to recover data from a sector, always. As a result,
the HDD may spend up to several seconds reading a sector again and again
applying different signal processing techniques until it gets the sector
ECC checked to return valid data. That of course can hugely increase an IO
latency seen by the host. In applications such as erasure coded
distributed object stores, maximum latency for an object access can thus
be kept low using this limit without compromising the data since the
object can always be rebuilt from the erasure codes if one HDD is slow to
respond. This limit is also interesting for video streaming/playback to
avoid video buffer underflow (at the expense of may be some block noise
depending on the codec).
 * The inactive time limit can be used to tell the drive how long it is
allowed to let a command stand in the drive internal queue before
processing. This is thus a parameter that allows a host to tune the drive
RPO optimization (rotational positioning optimization, e.g. HDD internal
command scheduling based on angular sector position on tracks withe the
head current position). This is a neat way to control max IOPS vs tail
latency since drives tend to privilege maximizing IOPS over lowering max
tail latency.
 * The duration guideline limit defines an overall time limit for a
command without distinguishing between active and inactive time. It is the
easiest to use (the easiest one to understand from a beginner user point
of view). This is a neat way to define an intelligent IO prioritization in
fact, way better than RT class scheduling on the host or the use of ATA
NCQ high priority, as it provides more information to the drive about the
urgency of a particular command. That allows the drive to still perform
RPO to maximize IOPS without long tail latencies. Chaining such limit with
an active+inactive time limit descriptor using the "next limit" policy
(0x1 policy) can also finely define what the drive should if the guideline
limit is exceeded (as the next descriptor can define what to do based on
the reason for the limit being exceeded: long internal queueing vs bad
sector long access time).

> If users really need the ability to use all standardized CDL features 
> and if there is no easy way to map CDL levels to an I/O priority, is the 
> I/O priority mechanism really the best basis for a user space interface 
> for CDLs?

As you can see above, yes, we need everything and should not attempt
restricting CDL use. The IO priority interface is a perfect fit for CDL in
the sense that all we need to pass along from user to device is one
number: the CDL index to use for a command. So creating a different
interface for this while the IO priority interface exactly does that
sounds silly to me.

One compromise we could do is: have the IO schedulers completely ignore
CDL prio class for now, that is, have them assume that no IO prio
class/level was specified. Given that they are not tuned to handle CDL
well anyway, this is probably the best thing to do for now.

We still need to have the block layer prevent merging of requests with
different CDL descriptors though, which is another reason to reuse the IO
prio interface as the block layer already does this. Less code, which is
always a good thing.

> 
> Thanks,
> 
> Bart.

-- 
Damien Le Moal
Western Digital Research