Hi Peter, > On Apr 8, 2019, at 2:45 PM, Song Liu <songliubraving@xxxxxx> wrote: > > Servers running latency sensitive workload usually aren't fully loaded for > various reasons including disaster readiness. The machines running our > interactive workloads (referred as main workload) have a lot of spare CPU > cycles that we would like to use for optimistic side jobs like video > encoding. However, our experiments show that the side workload has strong > impact on the latency of main workload: > > side-job main-load-level main-avg-latency > none 1.0 1.00 > none 1.1 1.10 > none 1.2 1.10 > none 1.3 1.10 > none 1.4 1.15 > none 1.5 1.24 > none 1.6 1.74 > > ffmpeg 1.0 1.82 > ffmpeg 1.1 2.74 > > Note: both the main-load-level and the main-avg-latency numbers are > _normalized_. > > In these experiments, ffmpeg is put in a cgroup with cpu.weight of 1 > (lowest priority). However, it consumes all idle CPU cycles in the > system and causes high latency for the main workload. Further experiments > and analysis (more details below) shows that, for the main workload to meet > its latency targets, it is necessary to limit the CPU usage of the side > workload so that there are some _idle_ CPU. There are various reasons > behind the need of idle CPU time. First, shared CPU resouce saturation > starts to happen way before time-measured utilization reaches 100%. > Secondly, scheduling latency starts to impact the main workload as CPU > reaches full utilization. > > Currently, the cpu controller provides two mechanisms to protect the main > workload: cpu.weight and cpu.max. However, neither of them is sufficient > in these use cases. As shown in the experiments above, side workload with > cpu.weight of 1 (lowest priority) would still consume all idle CPU and add > unacceptable latency to the main workload. cpu.max can throttle the CPU > usage of the side workload and preserve some idle CPU. However, cpu.max > cannot react to changes in load levels. For example, when the main > workload uses 40% of CPU, cpu.max of 30% for the side workload would yield > good latencies for the main workload. However, when the workload > experiences higher load levels and uses more CPU, the same setting (cpu.max > of 30%) would cause the interactive workload to miss its latency target. > > These experiments demonstrated the need for a mechanism to effectively > throttle CPU usage of the side workload and preserve idle CPU cycles. > The mechanism should be able to adjust the level of throttling based on > the load level of the main workload. > > This patchset introduces a new knob for cpu controller: cpu.headroom. > cgroup of the main workload uses cpu.headroom to ensure side workload to > use limited CPU cycles. For example, if a main workload has a cpu.headroom > of 30%. The side workload will be throttled to give 30% overall idle CPU. > If the main workload uses more than 70% of CPU, the side workload will only > run with configurable minimal cycles. This configurable minimal cycles is > referred as "tolerance" of the main workload. > > The following is a detailed example: > > main/cpu.headroom main-cpu-load low-pri-cpu-cycle idle-cpu > 30% 30% 40% 30% > 30% 40% 30% 30% > 30% 50% 20% 30% > 30% 60% 10% 30% > 30% 70% minimal ~30% > 30% 80% minimal ~20% > > In the example, we use a constant cpu.headroom setting of 30%. As main job > experiences different level of load, the cpu controller adjusts CPU cycles > used by the low-pri jobs. > > We experiemented with a web server as the main workload and ffmpeg as the > side workload. The following table compares latency impact on the main > workload under different cpu.headroom settings and load levels. In all > tests, the side workload cgroup is configured with cpu.weight of 1. When > throttled, the side workload can only run 1ms per 100ms period. > > average-latency > main-load-level w/o-side w/-side- w/-side- w/-side- > no-headroom 30%-headroom 20%-headroom > 1.0 1.00 1.82 1.26 1.14 > 1.1 1.10 2.74 1.26 1.32 > 1.2 1.10 1.29 1.38 > 1.3 1.10 1.32 1.49 > 1.4 1.15 1.29 1.85 > 1.5 1.24 1.32 > 1.6 1.74 1.50 > > Each row of the table shows a normalized load level and average latencies > for 4 scenarios: w/o side workload, w/ side workload but no headroom; w/ > side workload and 30% headroom; with side workload and 20% headroom. > > > When there is no side workload, average latency of main job falls in the > 0.7x range, except the very high load scenarios. When there is side > workload but no headroom, latency of the main job goes very high at > moderate load levels. With 30% headroom, the average latency falls in the > 0.8x range. With 20% headroom, the average latency falls in the 0.9x to > 1.x range. We didn't finish tests in some cases with high load, because > the latency is too high. > > This experiment demonstrated cpu.headroom is an effective and efficient > knob to control the latency of the main job. > > Thanks! Could you please kindly share your feedback and comments on this work? Thanks and Regards, Song > Song Liu (7): > sched: refactor tg_set_cfs_bandwidth() > cgroup: introduce hook css_has_tasks_changed > cgroup: introduce cgroup_parse_percentage > sched, cgroup: add entry cpu.headroom > sched/fair: global idleness counter for cpu.headroom > sched/fair: throttle task runtime based on cpu.headroom > Documentation: cgroup-v2: add information for cpu.headroom > > Documentation/admin-guide/cgroup-v2.rst | 18 + > fs/proc/stat.c | 4 +- > include/linux/cgroup-defs.h | 2 + > include/linux/cgroup.h | 1 + > include/linux/kernel_stat.h | 2 + > kernel/cgroup/cgroup.c | 51 +++ > kernel/sched/core.c | 425 ++++++++++++++++++++++-- > kernel/sched/fair.c | 143 +++++++- > kernel/sched/sched.h | 30 ++ > 9 files changed, 634 insertions(+), 42 deletions(-) > > -- > 2.17.1 >