[PATCH V2 00/13] block-throttle: proportional throttle

Shaohua Li <shli@xxxxxx> · Mon, 22 Feb 2016 14:01:15 -0800

Hi,

Currently we have 2 iocontrollers. blk-throttling is bandwidth/iops based. CFQ
is weight based. It would be great there is a unified iocontroller for the two.
And blk-mq doesn't support ioscheduler, leaving blk-throttling the only option
for blk-mq. It's time to have a scalable iocontroller supporting both
bandwidth/weight based control and working with blk-mq.

blk-throttling is a good candidate, it works for both blk-mq and legacy queue.
It has a global lock which is scaring for scalability, but it's not terrible in
practice. In my test, the NVMe IOPS can reach 1M/s and I have all CPU run IO. Enabling
blk-throttle has around 2~3% IOPS and 10% cpu utilization impact. I'd expect
this isn't a big problem for today's workload. This patchset then try to make a
unified iocontroller with blk-throttling.

The idea is pretty simple. If we know disk total capability, we can divide the
capability to cgroups according to their weight. blk-throttling can account IO
cost and use the calculated capability to throttle cgroup. The problem is how
to estimate the disk total capability and IO cost. We can use bandwidth
(capability is bandwidth, IO cost is request size), IOPS (capability is IOPS,
IO cost is 1), or anything better. There are a lot of disscussions about this
topic, but we can't find the best option right now. Bandwidth/IOPS isn't
optimal but the only option at hand and should work generally. This patch set
tries to create a framework and make bandwidth/IOPS based proportional
throttling work. Later we can add other options for capability/cost
measurement. I'll focus on bandwidth based approach here. The first 9 patches
demonstrate the ideas and should be pretty straightforward.

The problem is we don't know the max bandwidth a disk can provide for a
specific workload, which depends on the device and IO pattern. The estimated
bandwidth by patch 1 will be always not accurate unless the disk is already in
max bandwidth. To solve this issue, we always over estimate the bandwidth. Over
esitmate bandwidth, workload dispatchs more IO, estimated bandwidth becomes
higher, dispatches even more IO. The loop will run till we enter a stable
state, in which the disk gets max bandwidth. The 'slightly adjust and run into
stable state' is the core algorithm the patch series use. We also use it to
detect inactive cgroup.

Over estimate bandwidth can introduce fairness issue, because it's possible
some cgroups can use the extra bandwidth but others not, the cgroup using the
extra bandwidth gets more share than expected. On the other hand, smaller extra
bandwidth means disk gets to max bandwidth more slowly. We assign 1/8 extra
bandwidth in the patch.

The tricky part is cgroup might not use its share fully. The cgroup can run
into this because it is not willing to (for example, a fio job with '--rate'
option) or not able to (for example, a specific IO depth limits the bandwidth)
dispatch enough IO. Let's have some examples. Assume cg1 has 20% share, cg2 has
80% share.

1. disk bandwidth 200M/s. cgroup hasn't limitation
cg1 bps = 40M/s, cg2 bps = 160M/s

2. disk bandwidth 200M/s. cg2 has rate limit 10M/s
cg1 bps = 190M/s, cg2 bps = 10M/s
In this case, cg1's bps isn't 10/4 = 2.5M/s.

3. disk bandwidth 200M/s. cg2 has rate limit 10M/s, cg1 has limit 100M/s
cg1 bps = 100M/s, cg2 bps = 10M/s

We should detect cgroup which has big share but can't use its share. Otherwise,
we can't drive disk to max bandwidth. To solve the issue, if a cgroup doesn't
use its share, we adjust its weight/share. For example, cg2 in example 2 will
get 5% share even user sets its share to 80%. The adjustment is slight to avoid
spike, but we will reach a stable state eventually.

It's possible the adjustment is wrong. If a cgroup with shrinked share hits
bandwidth limit, we recover its share. There is another tricky adjustment case

4. disk bandwidth 100M/s. each cgroup has 80M/s.
cg1 bps = 20M/s, cg2 bps = 80M/s

This is the ideal state. cg1 uses its share fully. But since we assign 1/8
extra bandwidth, cg2 doesn't use its share fully. The adjustment algorithm will
adjust cg2's share, say 5M/s. cg1 gets 25M/s then. cg2 can only get 75M/s
because the disk max bandwidth is 100M/s. If this adjustment continues, both
cg1 and cg2 will get 50M/s bandwidth. To mitigate this issue, if a cgroup's
bandwidth drops after its share is shrinked, we restore its share. This is
controversial sometimes. In above case, when cg2 gets 75M/s, cg1 might get
40M/s, because IO pattern changes and max bandwidth changes. Somebody might
think 40M/75M is better, but this patch series choose 20M/80M.

We don't bias read/sync IO in the patch set yet. Idealy we should divide a
cgroup's share for read/write IO and give read IO more share. The problem is
some cgroups might only do read or write. A fixed read/write ratio will make
such cgroups waste their share. This issue can be fixed if we introduce a sub
service queue for read and write of a cgroup in the future.

I have been tested the patches in different setups. Test setup, scripts and
results are uploaded at:
https://github.com/shligit/iocontroller-test

There are still some tests we don't get optimal performance yet and some
calculations must be revised, but I think the code is good enough to
demonstrate the idea and different issues. Comments and benchmarks are warmly
welcome!

-----------------------------------------------------------------

Shaohua Li (13):
  block: estimate disk performance
  blk-throttle: cleanup io cost related stuff
  blk-throttle: add abstract to index data
  blk-throttle: weight based throttling
  blk-throttling: detect inactive cgroup
  blk-throttle: add per-cgroup data
  blk-throttle: add interface for proporation based throttle
  blk-throttle: add cgroup2 interface
  blk-throttle: add trace for new proporation throttle
  blk-throttle: over estimate bandwidth
  blk-throttle: shrink cgroup share if its target is overestimated
  blk-throttle: restore shrinked cgroup share
  blk-throttle: detect wrong shrink

 block/blk-core.c           |   56 ++
 block/blk-sysfs.c          |   13 +
 block/blk-throttle.c       | 1217 ++++++++++++++++++++++++++++++++++++++------
 include/linux/blk-cgroup.h |   10 +
 include/linux/blkdev.h     |    7 +
 5 files changed, 1158 insertions(+), 145 deletions(-)

-- 
2.6.5

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html