From: Chunguang Xu <brookxu@xxxxxxxxxxx> Any suggestions or discussions are welcome, thank you every much. BACKGROUND: In the container scenario, in addition to throughput, we also pay attention to Qos of each group. Based on hierarchical scheduling, EMQ, IO Injection, bfq.weight and other mechanisms, we can achieve better IO isolation, better throughput, better avoiding priority inversion. However, we still have something to be optimized. OPTIMIZATION: We try to do something to make bfq work better in the container scene. 1. Introduce bfq.ioprio Tasks in the production environment can be roughly divided into three categories: emergency, ordinary and offline. Emergency tasks need to be scheduled in real time, such as system agents. Offline tasks do not need to guarantee QoS, but can improve system resource utilization during system idle periods, such as background tasks. At present, we can use weights to simulate IO preemption, but since weights are more of a share concept, they cannot be simulated well. Using ioprio class for group, we can solve the above problems more easier. In this way, in hierarchical scheduling, we can ensure that RT and IDLE group can be scheduled correctly. In addition, we also introduce ioprio for group, so we realize a weight value through ioprio class and ioprio. In scenarios where only simple weights are needed, we can achieve IO preemption and weight isolation only through bfq.ioprio. After the introduction of bfq.ioprio, in order to better adapt to the actual container environment. When scheduling within a group, we use task ioprio class. But outside of group, we use group ioprio class. For example, when counting bfqd->busy_queues[], tasks from the CLASS_IDLE group are always regarded as CLASS_IDLE, and the ioprio class of the task is ignored. 2. Introduce better_faireness mode Better Qos control needs to sacrifice throughput, and this is not suitable for all scenarios. For this, we added a switch called better_fairness. After better_fairness is enabled, we will make the following restrictions: Guarantee group Qos: 1. Cooperator queue can only come from the same group and the same class. 2. Waker queue can only come from the same group and the same class. 3. Inject queue can only come from the same group of the same class. Guarantee RT tasks Qos: 1. Async_queue cannot inject RT queue. 2. Traverse the upper schedule domain to determine whether in_service_queue needs to be preempted. 3. If in_service_queue marked prio_expire, disable idle. Better Buffer IO control: 1. Except for the CLASS_IDLE queue, other queues allow idle by default. INTERFACE: The bfq.ioprio interface now is available for cgroup v1 and cgroup v2. Users can configure the ioprio for cgroup through this interface, as shown below: echo "1 2"> blkio.bfq.ioprio The above two values respectively represent the values of ioprio class and ioprio for cgroup. EXPERIMENT: The test process is as follows: # prepare data disk mount /dev/sdb /data1 # prepare IO scheduler echo bfq > /sys/block/sdb/queue/scheduler echo 0 > /sys/block/sdb/queue/iosched/low_latency echo 1 > /sys/block/sdb/queue/iosched/better_fairness It is worth noting here that nr_requests limits the number of requests, and it does not perceive priority. If nr_requests is too small, it may cause a serious priority inversion problem. Therefore, we can increase the size of nr_requests based on the actual situation. # create cgroup v1 hierarchy cd /sys/fs/cgroup/blkio mkdir rt be0 be1 be2 idle # prepare cgroup echo "1 0" > rt/blkio.bfq.ioprio echo "2 0" > be0/blkio.bfq.ioprio echo "2 4" > be1/blkio.bfq.ioprio echo "2 7" > be2/blkio.bfq.ioprio echo "3 0" > idle/blkio.bfq.ioprio # run fio test fio fio.ini # generate svg graph fio_generate_plots res The contents of fio.ini are as follows: [global] ioengine=libaio group_reporting=1 log_avg_msec=3000 direct=1 time_based=1 iodepth=16 size=100M rw=write bs=1M [rt] name=rt write_bw_log=rt write_lat_log=rt write_iops_log=rt filename=/data1/rt.bin cgroup=rt runtime=30s nice=-10 [be0] name=be0 write_bw_log=be0 write_lat_log=be0 write_iops_log=be0 filename=/data1/be0.bin cgroup=be0 runtime=60s [be1] name=be1 write_bw_log=be1 write_lat_log=be1 write_iops_log=be1 filename=/data1/be1.bin cgroup=be1 runtime=60s [be2] name=be2 write_bw_log=be2 write_lat_log=be2 write_iops_log=be2 filename=/data1/be2.bin cgroup=be2 runtime=60s [idle] name=idle write_bw_log=idle write_lat_log=idle write_iops_log=idle filename=/data1/idle.bin cgroup=idle runtime=90s V3: 1. introdule prio_expire for bfqq. 2. introduce better_fairness mode. 3. optimize the processing of task ioprio and group ioprio. 4. optimize some small points V2: 1. Optmise bfq_select_next_class(). 2. Introduce bfq_group [] to track the number of groups for each CLASS. 3. Optimse IO injection, EMQ and Idle mechanism for CLASS_RT. Chunguang Xu (14): bfq: introduce bfq_entity_to_bfqg helper method bfq: convert the type of bfq_group.bfqd to bfq_data* bfq: introduce bfq.ioprio for cgroup bfq: introduce bfq_ioprio_class to get ioprio class bfq: limit the IO depth of CLASS_IDLE to 1 bfq: keep the minimun bandwidth for CLASS_BE bfq: introduce better_fairness for container scene bfq: introduce prio_expire flag for bfq_queue bfq: expire in_serv_queue for prio_expire under better_fairness bfq: optimize IO injection under better_fairness bfq: disable idle for prio_expire under better_fairness bfq: disable merging between different groups under better_fairness bfq: remove unnecessary initialization logic bfq: optimize the calculation of bfq_weight_to_ioprio() block/bfq-cgroup.c | 99 ++++++++++++++++++++++++++--- block/bfq-iosched.c | 119 +++++++++++++++++++++++++++++++--- block/bfq-iosched.h | 36 +++++++++-- block/bfq-wf2q.c | 180 ++++++++++++++++++++++++++++++++++++++++++---------- 4 files changed, 376 insertions(+), 58 deletions(-) -- 1.8.3.1