Re: [PATCHSET v3 block/for-linus] IO cost model based work-conserving porportional controller

Paolo Valente <paolo.valente@xxxxxxxxxx> · Thu, 29 Aug 2019 17:54:38 +0200

Hi,
I see an important interface problem.  Userspace has been waiting for
io.weight to become eventually the file name for setting the weight
for the proportional-share policy [1,2].  If you use that name, how
will we solve this?

Thanks,
Paolo

[1] https://github.com/systemd/systemd/issues/7057#issuecomment-522747575
[2] https://github.com/systemd/systemd/pull/13335#issuecomment-523035303

> Il giorno 29 ago 2019, alle ore 00:05, Tejun Heo <tj@xxxxxxxxxx> ha scritto:
> 
> Changes from v2[2]:
> 
> * Fixed a divide-by-zero bug in current_hweight().
> 
> * pre_start_time and friends renamed to alloc_time and now has its own
>  CONFIG option which is selected by IOCOST.
> 
> Changes from v1[1]:
> 
> * Prerequisite patchsets had cosmetic changes and merged.  Refreshed
>  on top.
> 
> * Renamed from ioweight to iocost.  All source code and tools are
>  updated accordingly.  Control knobs io.weight.qos and
>  io.weight.cost_model are renamed to io.cost.qos and io.cost.model
>  respectively.  This is a more fitting name which won't become a
>  misnomer when, for example, cost based io.max is added.
> 
> * Various bug fixes and improvements.  A few bugs were discovered
>  while testing against high-iops nvme device.  Auto parameter
>  selection improved and verified across different classes of SSDs.
> 
> * Dropped bpf iocost support for now.
> 
> * Added coef generation script.
> 
> * Verified on high-iops nvme device.  Result is included below.
> 
> One challenge of controlling IO resources is the lack of trivially
> observable cost metric.  This is distinguished from CPU and memory
> where wallclock time and the number of bytes can serve as accurate
> enough approximations.
> 
> Bandwidth and iops are the most commonly used metrics for IO devices
> but depending on the type and specifics of the device, different IO
> patterns easily lead to multiple orders of magnitude variations
> rendering them useless for the purpose of IO capacity distribution.
> While on-device time, with a lot of clutches, could serve as a useful
> approximation for non-queued rotational devices, this is no longer
> viable with modern devices, even the rotational ones.
> 
> While there is no cost metric we can trivially observe, it isn't a
> complete mystery.  For example, on a rotational device, seek cost
> dominates while a contiguous transfer contributes a smaller amount
> proportional to the size.  If we can characterize at least the
> relative costs of these different types of IOs, it should be possible
> to implement a reasonable work-conserving proportional IO resource
> distribution.
> 
> This patchset implements IO cost model based work-conserving
> proportional controller.  It currently has a simple linear cost model
> builtin where each IO is classified as sequential or random and given
> a base cost accordingly and additional size-proportional cost is added
> on top.  Each IO is given a cost based on the model and the controller
> issues IOs for each cgroup according to their hierarchical weight.
> 
> By default, the controller adapts its overall IO rate so that it
> doesn't build up buffer bloat in the request_queue layer, which
> guarantees that the controller doesn't lose significant amount of
> total work.  However, this may not provide sufficient differentiation
> as the underlying device may have a deep queue and not be fair in how
> the queued IOs are serviced.  The controller provides extra QoS
> control knobs which allow tightening control feedback loop as
> necessary.
> 
> For more details on the control mechanism, implementation and
> interface, please refer to the comment at the top of
> block/blk-iocost.c and Documentation/admin-guide/cgroup-v2.rst changes
> in the "blkcg: implement blk-iocost" patch.
> 
> Here are some test results.  Each test run goes through the following
> combinations with each combination running for a minute.  All tests
> are performed against regular files on btrfs w/ deadline as the IO
> scheduler.  Random IOs are direct w/ queue depth of 64.  Sequential
> are normal buffered IOs.
> 
>        high priority (weight=500)      low priority (weight=100)
> 
>        Rand read                       None
>        ditto                           Rand read
>        ditto                           Seq  read
>        ditto                           Rand write
>        ditto                           Seq  write
>        Seq  read                       None
>        ditto                           Rand read
>        ditto                           Seq  read
>        ditto                           Rand write
>        ditto                           Seq  write
>        Rand write                      None
>        ditto                           Rand read
>        ditto                           Seq  read
>        ditto                           Rand write
>        ditto                           Seq  write
>        Seq  write                      None
>        ditto                           Rand read
>        ditto                           Seq  read
>        ditto                           Rand write
>        ditto                           Seq  write
> 
> * 7200RPM SATA hard disk
>  * No IO control
>    https://photos.app.goo.gl/1KBHn7ykpC1LXRkB8
>  * iocost, QoS: None
>    https://photos.app.goo.gl/MLNQGxCtBQ8wAmjm7
>  * iocost, QoS: rpct=95.00 rlat=40000 wpct=95.00 wlat=40000 min=25.00 max=200.00
>    https://photos.app.goo.gl/XqXHm3Mkbm9w6Db46
> * NCQ-blacklisted SATA SSD (QD==1)
>  * No IO control
>    https://photos.app.goo.gl/wCTXeu2uJ6LYL4pk8
>  * iocost, QoS: None
>    https://photos.app.goo.gl/T2HedKD2sywQgj7R9
>  * iocost, QoS: rpct=95.00 rlat=20000 wpct=95.00 wlat=20000 min=50.00 max=200.00
>    https://photos.app.goo.gl/urBTV8XQc1UqPJJw7
> * SATA SSD (QD==32)
>  * No IO control
>    https://photos.app.goo.gl/TjEVykuVudSQcryh6
>  * iocost, QoS: None
>    https://photos.app.goo.gl/iyQBsky7bmM54Xiq7
>  * iocost, QoS: rpct=95.00 rlat=10000 wpct=95.00 wlat=20000 min=50.00 max=400.00
>    https://photos.app.goo.gl/q1a6URLDxPLMrnHy5
> * NVME SSD (ran with 8 concurrent fio jobs to achieve saturation)
>  * No IO control
>    https://photos.app.goo.gl/S6xjEVTJzcfb3w1j7
>  * iocost, QoS: None
>    https://photos.app.goo.gl/SjQUUotJBAGr7vqz7
>  * iocost, QoS: rpct=95.00 rlat=5000 wpct=95.00 wlat=5000 min=1.00 max=10000.00
>    https://photos.app.goo.gl/RsaYBd2muX7CegoN7
> 
> Even without explicit QoS configuration, read-heavy scenarios can
> obtain acceptable differentiation.  However, when write-heavy, the
> deep buffering on the device side makes it difficult to maintain
> control.  With QoS parameters set, the differentiation is acceptable
> across all combinations.
> 
> The implementation comes with default cost model parameters which are
> selected automatically which should provide acceptable behavior across
> most common devices.  The parameters for hdd and consumer-grade SSDs
> seem pretty robust.  The default parameter set and selection criteria
> for highend SSDs might need further adjustments.
> 
> It is fairly easy to configure the QoS parameters and, if needed, cost
> model coefficients.  We'll follow up with tooling and further
> documentation.  Also, the last RFC patch in the series implements
> support for bpf-based custom cost function.  Originally we thought
> that we'd need per-device-type cost functions but the simple linear
> model now seem good enough to cover all common device classes.  In
> case custom cost functions become necessary, we can fully develop the
> bpf based extension and also easily add different builtin cost models.
> 
> Andy Newell did the heavy lifting of analyzing IO workloads and device
> characteristics, exploring various cost models, determining the
> default model and parameters to use.
> 
> Josef Bacik implemented a prototype which explored the use of
> different types of cost metrics including on-device time and Andy's
> linear model.
> 
> This patchset is on top of the current block/for-next 53fc55c817c3
> ("Merge branch 'for-5.4/block' into for-next") and contains the
> following 10 patches.
> 
> 0001-blkcg-pass-q-and-blkcg-into-blkcg_pol_alloc_pd_fn.patch
> 0002-blkcg-make-cpd_init_fn-optional.patch
> 0003-blkcg-separate-blkcg_conf_get_disk-out-of-blkg_conf_.patch
> 0004-block-rq_qos-add-rq_qos_merge.patch
> 0005-block-rq_qos-implement-rq_qos_ops-queue_depth_change.patch
> 0006-blkcg-s-RQ_QOS_CGROUP-RQ_QOS_LATENCY.patch
> 0007-blk-mq-add-optional-request-alloc_time_ns.patch
> 0008-blkcg-implement-blk-iocost.patch
> 0009-blkcg-add-tools-cgroup-iocost_monitor.py.patch
> 0010-blkcg-add-tools-cgroup-iocost_coef_gen.py.patch
> 
> 0001-0007 are prep patches.
> 0008 implements blk-iocost.
> 0009 adds monitoring script.
> 0010 adds linear cost model coefficient generation script.
> 
> The patchset is also available in the following git branch.
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-iow-v2
> 
> diffstat follows, Thanks.
> 
> Documentation/admin-guide/cgroup-v2.rst |   97 +
> block/Kconfig                           |   13 
> block/Makefile                          |    1 
> block/bfq-cgroup.c                      |    5 
> block/blk-cgroup.c                      |   71 
> block/blk-core.c                        |    4 
> block/blk-iocost.c                      | 2395 ++++++++++++++++++++++++++++++++
> block/blk-iolatency.c                   |    8 
> block/blk-mq.c                          |   13 
> block/blk-rq-qos.c                      |   18 
> block/blk-rq-qos.h                      |   28 
> block/blk-settings.c                    |    2 
> block/blk-throttle.c                    |    6 
> block/blk-wbt.c                         |   18 
> block/blk-wbt.h                         |    4 
> include/linux/blk-cgroup.h              |    4 
> include/linux/blk_types.h               |    3 
> include/linux/blkdev.h                  |   13 
> include/trace/events/iocost.h           |  174 ++
> tools/cgroup/iocost_coef_gen.py         |  178 ++
> tools/cgroup/iocost_monitor.py          |  270 +++
> 21 files changed, 3272 insertions(+), 53 deletions(-)
> 
> --
> tejun
> 
> [1] http://lkml.kernel.org/r/20190614015620.1587672-1-tj@xxxxxxxxxx
> [2] http://lkml.kernel.org/r/20190710205128.1316483-1-tj@xxxxxxxxxx
>