Hi, This patch set adds low/high limit for blk-throttle cgroup. The interface is io.low and io.high. low limit implements best effort bandwidth/iops protection. If one cgroup doesn't reach its low limit, no other cgroups can use more bandwidth/iops than their low limit. cgroup without low limit is not protected. If there is cgroup with low limit but the cgroup doesn't reach low limit yet, the cgroup without low limit will be throttled to very low bandwidth/iops. high limit implements best effort limitation. cgroup with high limit can use more than high limit bandwidth/iops if all cgroups use at least high limit bandwidth/iops. If one cgroup is below its high limit, all cgroups can't use more bandwidth/iops than their high limit. If some cgroups have high limit and the others haven't, the cgroups without high limit will use max limit as their high limit. The disk queue has a state machine. We have 3 states LIMIT_LOW, LIMIT_HIGH and LIMIT_MAX. In each state, we throttle cgroups up to a limit according to their state limit. LIMIT_LOW state limit is low limit, LIMIT_HIGH high limit and LIMIT_MAX max limit. In a state, if condition meets, queue can upgrade to higher level state or downgrade to lower level state. For example, queue is in LIMIT_LOW state and all cgroups reach their low limit, the queue will be upgraded to LIMIT_HIGH. In another example, queue is in LIMIT_MAX state, but one cgroup is below its high limit, the queue will be downgraded to LIMIT_HIGH. If all cgroups don't have limit for specific state, the state will be invalid. We will skip invalid state for upgrading/downgrading. Initially queue state is LIMIT_MAX till some cgroup gets low/high limit set, so this will maintain backward compatibility for users with only max limist set. If downgrade/upgrade only happens according to limit, we will have performance issue. For example, if one cgroup has low limit set but the cgroup never dispatch enough IO to reach low limit, the queue state will remain in LIMIT_LOW. Other cgroups will be throttled and the whole disk utilization will be low. To solve this issue, if cgroup is below limit for a long time, we treat the cgroup idle and its corresponding limit will be ignored for upgrade/downgrade logic. The idle based upgrade could introduce a dilemma though, since we will do downgrade if cgroup is below its limit (eg idle). For example, if a cgroup is below its low limit for a long time, queue is upgraded to HIGH state. The cgroup continues to be below its low limit, the queue will be downgraded to LOW state. In this example, the queue will keep switching state between LOW and HIGH. The key to avoid unnecessary state switching is to detect if cgroup is truly idle, which is a hard problem unfortunately. There are two kinds of idle. One is cgroup intends to not dispatch enough IO (real idle). In this case, we should do upgrade quickly and don't do downgrade. The other is other cgroups dispatch too many IO and use all bandwidth, the cgroup can't dispatch enough IO and looks idle (fake idle). In this case, we should do downgrade quickly and never do upgrade. Destinguishing the two kinds of idle is impossible for a high queue depth disk as far as I can tell. This patch set doesn't try to precisely detect idle. Instead we record history of upgrade. If queue upgrades because cgroup hits limit, future downgrade is likely because of fake idle, hence future upgrade should run slowly and future downgrade should run quickly. Otherwise future downgrade is likely because of real idle, hence future upgrade should run quickly and future downgrade should run slowly. The adaptive upgrade/downgrade time means disk downgrade in real idle happens rarely and disk upgrade in fake idle happens rarely. This doesn't avoid repeatedly state switching though. Please see patch 6 for details. User must carefully set the limits. Inproper setting could be ignored. For example, disk max bandwidth is 100M/s. One cgroup has low limit 60M/s, the other 50M/s. When the first cgroup runs in 60M/s, there is only 40M/s bandwidth remaining. The second cgroup will never reach 50M/s, so the cgroup will be treated idle and its limit will be literally ignored. Comments and benchmarks are welcome! Thanks, Shaohua Shaohua Li (10): block-throttle: prepare support multiple limits block-throttle: add .low interface block-throttle: configure bps/iops limit for cgroup in low limit block-throttle: add upgrade logic for LIMIT_LOW state block-throttle: add downgrade logic block-throttle: idle detection block-throttle: add .high interface block-throttle: handle high limit blk-throttle: make sure expire time isn't too big blk-throttle: add trace log block/blk-throttle.c | 813 +++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 764 insertions(+), 49 deletions(-) -- 2.8.0.rc2 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html