Hi, Currently we have 2 iocontrollers. blk-throttling is bandwidth/iops based. CFQ is weight based. It would be great there is a unified iocontroller for the two. And blk-mq doesn't support ioscheduler, leaving blk-throttling the only option for blk-mq. It's time to have a scalable iocontroller supporting both bandwidth/weight based control and working with blk-mq. blk-throttling is a good candidate, it works for both blk-mq and legacy queue. It has a global lock which is scaring for scalability, but it's not terrible in practice. In my test, the NVMe IOPS can reach 1M/s and I have all CPU run IO. Enabling blk-throttle has around 2~3% IOPS and 10% cpu utilization impact. I'd expect this isn't a big problem for today's workload. This patchset then try to make a unified iocontroller with blk-throttling. The idea is pretty simple. If we know disk total capability, we can divide the capability to cgroups according to their weight. blk-throttling can account IO cost and use the calculated capability to throttle cgroup. The problem is how to estimate the disk total capability and IO cost. We can use bandwidth (capability is bandwidth, IO cost is request size), IOPS (capability is IOPS, IO cost is 1), or anything better. There are a lot of disscussions about this topic, but we can't find the best option right now. Bandwidth/IOPS isn't optimal but the only option at hand and should work generally. This patch set tries to create a framework and make bandwidth/IOPS based proportional throttling work. Later we can add other options for capability/cost measurement. I'll focus on bandwidth based approach here. The first 9 patches demonstrate the ideas and should be pretty straightforward. The problem is we don't know the max bandwidth a disk can provide for a specific workload, which depends on the device and IO pattern. The estimated bandwidth by patch 1 will be always not accurate unless the disk is already in max bandwidth. To solve this issue, we always over estimate the bandwidth. Over esitmate bandwidth, workload dispatchs more IO, estimated bandwidth becomes higher, dispatches even more IO. The loop will run till we enter a stable state, in which the disk gets max bandwidth. The 'slightly adjust and run into stable state' is the core algorithm the patch series use. We also use it to detect inactive cgroup. Over estimate bandwidth can introduce fairness issue, because it's possible some cgroups can use the extra bandwidth but others not, the cgroup using the extra bandwidth gets more share than expected. On the other hand, smaller extra bandwidth means disk gets to max bandwidth more slowly. We assign 1/8 extra bandwidth in the patch. The tricky part is cgroup might not use its share fully. The cgroup can run into this because it is not willing to (for example, a fio job with '--rate' option) or not able to (for example, a specific IO depth limits the bandwidth) dispatch enough IO. Let's have some examples. Assume cg1 has 20% share, cg2 has 80% share. 1. disk bandwidth 200M/s. cgroup hasn't limitation cg1 bps = 40M/s, cg2 bps = 160M/s 2. disk bandwidth 200M/s. cg2 has rate limit 10M/s cg1 bps = 190M/s, cg2 bps = 10M/s In this case, cg1's bps isn't 10/4 = 2.5M/s. 3. disk bandwidth 200M/s. cg2 has rate limit 10M/s, cg1 has limit 100M/s cg1 bps = 100M/s, cg2 bps = 10M/s We should detect cgroup which has big share but can't use its share. Otherwise, we can't drive disk to max bandwidth. To solve the issue, if a cgroup doesn't use its share, we adjust its weight/share. For example, cg2 in example 2 will get 5% share even user sets its share to 80%. The adjustment is slight to avoid spike, but we will reach a stable state eventually. It's possible the adjustment is wrong. If a cgroup with shrinked share hits bandwidth limit, we recover its share. There is another tricky adjustment case 4. disk bandwidth 100M/s. each cgroup has 80M/s. cg1 bps = 20M/s, cg2 bps = 80M/s This is the ideal state. cg1 uses its share fully. But since we assign 1/8 extra bandwidth, cg2 doesn't use its share fully. The adjustment algorithm will adjust cg2's share, say 5M/s. cg1 gets 25M/s then. cg2 can only get 75M/s because the disk max bandwidth is 100M/s. If this adjustment continues, both cg1 and cg2 will get 50M/s bandwidth. To mitigate this issue, if a cgroup's bandwidth drops after its share is shrinked, we restore its share. This is controversial sometimes. In above case, when cg2 gets 75M/s, cg1 might get 40M/s, because IO pattern changes and max bandwidth changes. Somebody might think 40M/75M is better, but this patch series choose 20M/80M. We don't bias read/sync IO in the patch set yet. Idealy we should divide a cgroup's share for read/write IO and give read IO more share. The problem is some cgroups might only do read or write. A fixed read/write ratio will make such cgroups waste their share. This issue can be fixed if we introduce a sub service queue for read and write of a cgroup in the future. I have been tested the patches in different setups. Test setup, scripts and results are uploaded at: https://github.com/shligit/iocontroller-test There are still some tests we don't get optimal performance yet and some calculations must be revised, but I think the code is good enough to demonstrate the idea and different issues. Comments and benchmarks are warmly welcome! ----------------------------------------------------------------- Shaohua Li (13): block: estimate disk performance blk-throttle: cleanup io cost related stuff blk-throttle: add abstract to index data blk-throttle: weight based throttling blk-throttling: detect inactive cgroup blk-throttle: add per-cgroup data blk-throttle: add interface for proporation based throttle blk-throttle: add cgroup2 interface blk-throttle: add trace for new proporation throttle blk-throttle: over estimate bandwidth blk-throttle: shrink cgroup share if its target is overestimated blk-throttle: restore shrinked cgroup share blk-throttle: detect wrong shrink block/blk-core.c | 56 ++ block/blk-sysfs.c | 13 + block/blk-throttle.c | 1217 ++++++++++++++++++++++++++++++++++++++------ include/linux/blk-cgroup.h | 10 + include/linux/blkdev.h | 7 + 5 files changed, 1158 insertions(+), 145 deletions(-) -- 2.6.5 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html