On 1/29/23 05:25, Martin K. Petersen wrote: > > Damien, > > Finally had a window where I could sit down a read this extremely long > thread. > >> I am not denying anything. I simply keep telling you that CDL is not a >> generic feature for random applications to use, including those that >> already use RT/BE/IDLE. It is for applications that know and expect >> it, and so have a setup suited for CDL use down to the drive CDL >> descriptors. That includes DM setups. >> >> Thinking about CDL in a generic setup for any random application to >> use is nonsense. And even if that happens and a user not knowing about >> it still tries it, than as mentioned, nothing bad will happen. Using >> CDL in a setup that does not support it is a NOP. That would be the >> same as not using it. > > My observations: > > - Wrt. ioprio as conduit, I personally really dislike the idea of > conflating priority (relative performance wrt. other I/O) with CDL > (which is a QoS concept). I would really prefer those things to be > separate. However, I do think that the ioprio *interface* is a good > fit. A tool like ionice seems like a reasonable approach to letting > generic applications set their CDL. The definition of IOPRIO_CLASS_CDL was more about reusing the ioprio *interface* rather than having CDL support defined as a fully functional IO priority class. As I argued in this thread, and I think you agreee, CDL semantic is more than the simple priority class/level ordering. > If bio space wasn't a premium, I'd say just keep things separate. > But given the inherent incompatibility between kernel I/O scheduling > and CDL, it's probably not worth the hassle to separate them. As much > as it pains me to mix two concepts which should be completely > orthogonal. > > I wish we could let applications specify both a priority and a CDL at > the same time, though. Even if the kernel plumbing in the short term > ends up using bi_ioprio as conduit. It's much harder to make changes > in the application interface at a later date. See below. There may be a solution about that. > - Wrt. "CDL is not a generic feature", I think you are underestimating > how many applications actually want something like this. We have > many. > > I don't think we should design for "special interest only, needs > custom device tweaking to be usable". We have been down that path > before (streams, etc.). And with poor results. OK. > I/O hints also tanked but at least we tried to pre-define performance > classes that made sense in an abstract fashion. And programmed the > mode page on discovered devices so that the classes were identical > across all disks, regardless of whether they were SSDs or million > dollar arrays. This allowed filesystems to communicate "this is > metadata" regardless of the device the I/O was going to. Instead of > "index 5 on this device" but "index 42 on the mirror". > > As such, I don't like the "just customize your settings with > cdltools" approach. I'd much rather see us try to define a few QoS > classes that make sense that would apply to every app and use those > to define the application interface. And then have the kernel program > those CDL classes into SCSI/ATA devices by default. Makes sense. Though I think it will be hard to define a set of QoS hints that are useful for a wide range of applications, and even harder to convert the defined hint classes to CDL descriptors. I fear that we may end up with the same issues as IO hints/streams. > Having the kernel provide an abstract interface for bio QoS and > configuring a new disk with a sane handful of classes does not > prevent $CLOUD_VENDOR from overriding what Linux configured. But at > least we'd have a generic approach to block QoS in Linux. Similar to > the existing I/O priority infrastructure which is also not tied to > any particular hardware feature. OK. See below about this. > A generic implementation also allows us to do fancy things in the > hypervisor where we would like to be able to do QoS across multiple > devices as well. Without having ATA or SCSI with CDL involved. Or > whatever things might look like in NVMe. Fair point, especially given that virtio actually already forwards a guest ioprio to the host through the virtio block command. Thinking of that particular point together with what you said, I came up with the change show below as a replacement for this patch 1/18. This changes the 13-bits ioprio data into a 3-bits QOS hint + 3-bits of IO prio level. This is consistent with the IO prio interface since IO priority levels have to be between 0 and 7 (otherwise, errors are returned). So in fact, the upper 10-bits of the ioprio data are ignored and we can safely use 3 of these bits for an IO hint. This hint applies to all priority classes and levels, that is, for the CDL case, we can enrich any priority with a hint that specifies the CDL index to use for an IO. This falls short of actually defining generic IO hints, but this has the advantage to not break anything for current applications using IO priorities, not require any change to existing IO schedulers, while still allowing to pass CDL indexes for IOs down to the scsi & ATA layers (which for now would be the only layers in the kernel acting on the ioprio qos hints). I think that this approach still allows us to enable CDL support, and on top of it, go further and define generic QOS hints that IO scheduler can use and that also potentially map to CDL for scsi & ata (similarly to the RT class IOs mapping to the NCQ priority feature if the user enabled that feature). As mentioned above, I think that defining generic IO hint classes will be difficult. But the change below is I think a good a starting point that should not prevent working on that. Thoughts ? Bart, Given that you did not like the IOPRIO_CLASS_CDL, what do you think of this approach ? diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index ccf2204477a5..9b3c8fb806f1 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -5378,11 +5378,11 @@ bfq_set_next_ioprio_data(struct bfq_queue *bfqq, struct bfq_io_cq *bic) bfqq->new_ioprio_class = task_nice_ioclass(tsk); break; case IOPRIO_CLASS_RT: - bfqq->new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio); + bfqq->new_ioprio = IOPRIO_PRIO_LEVEL(bic->ioprio); bfqq->new_ioprio_class = IOPRIO_CLASS_RT; break; case IOPRIO_CLASS_BE: - bfqq->new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio); + bfqq->new_ioprio = IOPRIO_PRIO_LEVEL(bic->ioprio); bfqq->new_ioprio_class = IOPRIO_CLASS_BE; break; case IOPRIO_CLASS_IDLE: @@ -5671,7 +5671,7 @@ static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd, struct bfq_io_cq *bic, bool respawn) { - const int ioprio = IOPRIO_PRIO_DATA(bic->ioprio); + const int ioprio = IOPRIO_PRIO_LEVEL(bic->ioprio); const int ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio); struct bfq_queue **async_bfqq = NULL; struct bfq_queue *bfqq; diff --git a/block/ioprio.c b/block/ioprio.c index 32a456b45804..33f327a10811 100644 --- a/block/ioprio.c +++ b/block/ioprio.c @@ -33,7 +33,7 @@ int ioprio_check_cap(int ioprio) { int class = IOPRIO_PRIO_CLASS(ioprio); - int data = IOPRIO_PRIO_DATA(ioprio); + int level = IOPRIO_PRIO_LEVEL(ioprio); switch (class) { case IOPRIO_CLASS_RT: @@ -49,13 +49,13 @@ int ioprio_check_cap(int ioprio) fallthrough; /* rt has prio field too */ case IOPRIO_CLASS_BE: - if (data >= IOPRIO_NR_LEVELS || data < 0) + if (level >= IOPRIO_NR_LEVELS || level < 0) return -EINVAL; break; case IOPRIO_CLASS_IDLE: break; case IOPRIO_CLASS_NONE: - if (data) + if (level) return -EINVAL; break; default: diff --git a/fs/f2fs/sysfs.c b/fs/f2fs/sysfs.c index 83a366f3ee80..eba23e6a7bf6 100644 --- a/fs/f2fs/sysfs.c +++ b/fs/f2fs/sysfs.c @@ -314,7 +314,7 @@ static ssize_t f2fs_sbi_show(struct f2fs_attr *a, struct ckpt_req_control *cprc = &sbi->cprc_info; int len = 0; int class = IOPRIO_PRIO_CLASS(cprc->ckpt_thread_ioprio); - int data = IOPRIO_PRIO_DATA(cprc->ckpt_thread_ioprio); + int level = IOPRIO_PRIO_LEVEL(cprc->ckpt_thread_ioprio); if (class == IOPRIO_CLASS_RT) len += scnprintf(buf + len, PAGE_SIZE - len, "rt,"); @@ -323,7 +323,7 @@ static ssize_t f2fs_sbi_show(struct f2fs_attr *a, else return -EINVAL; - len += scnprintf(buf + len, PAGE_SIZE - len, "%d\n", data); + len += scnprintf(buf + len, PAGE_SIZE - len, "%d\n", level); return len; } diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h index f70f2596a6bf..1d90349a19c9 100644 --- a/include/uapi/linux/ioprio.h +++ b/include/uapi/linux/ioprio.h @@ -37,6 +37,18 @@ enum { #define IOPRIO_NR_LEVELS 8 #define IOPRIO_BE_NR IOPRIO_NR_LEVELS +/* + * The 13-bits of ioprio data for each class provide up to 8 QOS hints and + * up to 8 priority levels. + */ +#define IOPRIO_PRIO_LEVEL_MASK (IOPRIO_NR_LEVELS - 1) +#define IOPRIO_QOS_HINT_SHIFT 10 +#define IOPRIO_NR_QOS_HINTS 8 +#define IOPRIO_QOS_HINT_MASK (IOPRIO_NR_QOS_HINTS - 1) +#define IOPRIO_PRIO_LEVEL(ioprio) ((ioprio) & IOPRIO_PRIO_LEVEL_MASK) +#define IOPRIO_QOS_HINT(ioprio) \ + (((ioprio) >> IOPRIO_QOS_HINT_SHIFT) & IOPRIO_QOS_HINT_MASK) + enum { IOPRIO_WHO_PROCESS = 1, IOPRIO_WHO_PGRP, -- Damien Le Moal Western Digital Research