Hi, I see an important interface problem. Userspace has been waiting for io.weight to become eventually the file name for setting the weight for the proportional-share policy [1,2]. If you use that name, how will we solve this? Thanks, Paolo [1] https://github.com/systemd/systemd/issues/7057#issuecomment-522747575 [2] https://github.com/systemd/systemd/pull/13335#issuecomment-523035303 > Il giorno 29 ago 2019, alle ore 00:05, Tejun Heo <tj@xxxxxxxxxx> ha scritto: > > Changes from v2[2]: > > * Fixed a divide-by-zero bug in current_hweight(). > > * pre_start_time and friends renamed to alloc_time and now has its own > CONFIG option which is selected by IOCOST. > > Changes from v1[1]: > > * Prerequisite patchsets had cosmetic changes and merged. Refreshed > on top. > > * Renamed from ioweight to iocost. All source code and tools are > updated accordingly. Control knobs io.weight.qos and > io.weight.cost_model are renamed to io.cost.qos and io.cost.model > respectively. This is a more fitting name which won't become a > misnomer when, for example, cost based io.max is added. > > * Various bug fixes and improvements. A few bugs were discovered > while testing against high-iops nvme device. Auto parameter > selection improved and verified across different classes of SSDs. > > * Dropped bpf iocost support for now. > > * Added coef generation script. > > * Verified on high-iops nvme device. Result is included below. > > One challenge of controlling IO resources is the lack of trivially > observable cost metric. This is distinguished from CPU and memory > where wallclock time and the number of bytes can serve as accurate > enough approximations. > > Bandwidth and iops are the most commonly used metrics for IO devices > but depending on the type and specifics of the device, different IO > patterns easily lead to multiple orders of magnitude variations > rendering them useless for the purpose of IO capacity distribution. > While on-device time, with a lot of clutches, could serve as a useful > approximation for non-queued rotational devices, this is no longer > viable with modern devices, even the rotational ones. > > While there is no cost metric we can trivially observe, it isn't a > complete mystery. For example, on a rotational device, seek cost > dominates while a contiguous transfer contributes a smaller amount > proportional to the size. If we can characterize at least the > relative costs of these different types of IOs, it should be possible > to implement a reasonable work-conserving proportional IO resource > distribution. > > This patchset implements IO cost model based work-conserving > proportional controller. It currently has a simple linear cost model > builtin where each IO is classified as sequential or random and given > a base cost accordingly and additional size-proportional cost is added > on top. Each IO is given a cost based on the model and the controller > issues IOs for each cgroup according to their hierarchical weight. > > By default, the controller adapts its overall IO rate so that it > doesn't build up buffer bloat in the request_queue layer, which > guarantees that the controller doesn't lose significant amount of > total work. However, this may not provide sufficient differentiation > as the underlying device may have a deep queue and not be fair in how > the queued IOs are serviced. The controller provides extra QoS > control knobs which allow tightening control feedback loop as > necessary. > > For more details on the control mechanism, implementation and > interface, please refer to the comment at the top of > block/blk-iocost.c and Documentation/admin-guide/cgroup-v2.rst changes > in the "blkcg: implement blk-iocost" patch. > > Here are some test results. Each test run goes through the following > combinations with each combination running for a minute. All tests > are performed against regular files on btrfs w/ deadline as the IO > scheduler. Random IOs are direct w/ queue depth of 64. Sequential > are normal buffered IOs. > > high priority (weight=500) low priority (weight=100) > > Rand read None > ditto Rand read > ditto Seq read > ditto Rand write > ditto Seq write > Seq read None > ditto Rand read > ditto Seq read > ditto Rand write > ditto Seq write > Rand write None > ditto Rand read > ditto Seq read > ditto Rand write > ditto Seq write > Seq write None > ditto Rand read > ditto Seq read > ditto Rand write > ditto Seq write > > * 7200RPM SATA hard disk > * No IO control > https://photos.app.goo.gl/1KBHn7ykpC1LXRkB8 > * iocost, QoS: None > https://photos.app.goo.gl/MLNQGxCtBQ8wAmjm7 > * iocost, QoS: rpct=95.00 rlat=40000 wpct=95.00 wlat=40000 min=25.00 max=200.00 > https://photos.app.goo.gl/XqXHm3Mkbm9w6Db46 > * NCQ-blacklisted SATA SSD (QD==1) > * No IO control > https://photos.app.goo.gl/wCTXeu2uJ6LYL4pk8 > * iocost, QoS: None > https://photos.app.goo.gl/T2HedKD2sywQgj7R9 > * iocost, QoS: rpct=95.00 rlat=20000 wpct=95.00 wlat=20000 min=50.00 max=200.00 > https://photos.app.goo.gl/urBTV8XQc1UqPJJw7 > * SATA SSD (QD==32) > * No IO control > https://photos.app.goo.gl/TjEVykuVudSQcryh6 > * iocost, QoS: None > https://photos.app.goo.gl/iyQBsky7bmM54Xiq7 > * iocost, QoS: rpct=95.00 rlat=10000 wpct=95.00 wlat=20000 min=50.00 max=400.00 > https://photos.app.goo.gl/q1a6URLDxPLMrnHy5 > * NVME SSD (ran with 8 concurrent fio jobs to achieve saturation) > * No IO control > https://photos.app.goo.gl/S6xjEVTJzcfb3w1j7 > * iocost, QoS: None > https://photos.app.goo.gl/SjQUUotJBAGr7vqz7 > * iocost, QoS: rpct=95.00 rlat=5000 wpct=95.00 wlat=5000 min=1.00 max=10000.00 > https://photos.app.goo.gl/RsaYBd2muX7CegoN7 > > Even without explicit QoS configuration, read-heavy scenarios can > obtain acceptable differentiation. However, when write-heavy, the > deep buffering on the device side makes it difficult to maintain > control. With QoS parameters set, the differentiation is acceptable > across all combinations. > > The implementation comes with default cost model parameters which are > selected automatically which should provide acceptable behavior across > most common devices. The parameters for hdd and consumer-grade SSDs > seem pretty robust. The default parameter set and selection criteria > for highend SSDs might need further adjustments. > > It is fairly easy to configure the QoS parameters and, if needed, cost > model coefficients. We'll follow up with tooling and further > documentation. Also, the last RFC patch in the series implements > support for bpf-based custom cost function. Originally we thought > that we'd need per-device-type cost functions but the simple linear > model now seem good enough to cover all common device classes. In > case custom cost functions become necessary, we can fully develop the > bpf based extension and also easily add different builtin cost models. > > Andy Newell did the heavy lifting of analyzing IO workloads and device > characteristics, exploring various cost models, determining the > default model and parameters to use. > > Josef Bacik implemented a prototype which explored the use of > different types of cost metrics including on-device time and Andy's > linear model. > > This patchset is on top of the current block/for-next 53fc55c817c3 > ("Merge branch 'for-5.4/block' into for-next") and contains the > following 10 patches. > > 0001-blkcg-pass-q-and-blkcg-into-blkcg_pol_alloc_pd_fn.patch > 0002-blkcg-make-cpd_init_fn-optional.patch > 0003-blkcg-separate-blkcg_conf_get_disk-out-of-blkg_conf_.patch > 0004-block-rq_qos-add-rq_qos_merge.patch > 0005-block-rq_qos-implement-rq_qos_ops-queue_depth_change.patch > 0006-blkcg-s-RQ_QOS_CGROUP-RQ_QOS_LATENCY.patch > 0007-blk-mq-add-optional-request-alloc_time_ns.patch > 0008-blkcg-implement-blk-iocost.patch > 0009-blkcg-add-tools-cgroup-iocost_monitor.py.patch > 0010-blkcg-add-tools-cgroup-iocost_coef_gen.py.patch > > 0001-0007 are prep patches. > 0008 implements blk-iocost. > 0009 adds monitoring script. > 0010 adds linear cost model coefficient generation script. > > The patchset is also available in the following git branch. > > git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-iow-v2 > > diffstat follows, Thanks. > > Documentation/admin-guide/cgroup-v2.rst | 97 + > block/Kconfig | 13 > block/Makefile | 1 > block/bfq-cgroup.c | 5 > block/blk-cgroup.c | 71 > block/blk-core.c | 4 > block/blk-iocost.c | 2395 ++++++++++++++++++++++++++++++++ > block/blk-iolatency.c | 8 > block/blk-mq.c | 13 > block/blk-rq-qos.c | 18 > block/blk-rq-qos.h | 28 > block/blk-settings.c | 2 > block/blk-throttle.c | 6 > block/blk-wbt.c | 18 > block/blk-wbt.h | 4 > include/linux/blk-cgroup.h | 4 > include/linux/blk_types.h | 3 > include/linux/blkdev.h | 13 > include/trace/events/iocost.h | 174 ++ > tools/cgroup/iocost_coef_gen.py | 178 ++ > tools/cgroup/iocost_monitor.py | 270 +++ > 21 files changed, 3272 insertions(+), 53 deletions(-) > > -- > tejun > > [1] http://lkml.kernel.org/r/20190614015620.1587672-1-tj@xxxxxxxxxx > [2] http://lkml.kernel.org/r/20190710205128.1316483-1-tj@xxxxxxxxxx >