From: Pavel Begunkov <asml.silence@xxxxxxxxx> Sleep for (mean / 2) in the adaptive polling is often too pessimistic, use a variation of the 3-sigma rule (mean - 4 * lmd) and tune it in runtime using percentage of missed (i.e. overslept) requests: 1. if more than ~3% of requests are missed, then fallback to (mean / 2) 2. if more than ~0.4% is missed, then scale down Pitfalls: 1. any missed request increases the mean, synergistically increasing mean and sleep time, so, scale down fast in the case 2. even if the sleep time is predicted well, sleep loop could greatly oversleep by itself. Then try to detect it and skip the miss accounting. Tested on an NVMe SSD: {4K,8K} read-only workloads give similar latency distribution (up to 7 nines), and decreases CPU load twice (50% -> 25%). New method even outperform the old one a bit (in terms of throughput and latencies), presumably, because it alleviates the 2nd pitfall. For write-only workload it falls back to (mean / 2). Signed-off-by: Pavel Begunkov <asml.silence@xxxxxxxxx> --- block/blk-mq.c | 44 +++++++++++++++++++++++++++++++++++++------- 1 file changed, 37 insertions(+), 7 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index ec7cde754c2f..efa44a617bea 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -3338,10 +3338,21 @@ static void blk_mq_poll_stats_start(struct request_queue *q) blk_stat_activate_msecs(q->poll_cb, 100); } +/* + * Thresholds are ilog2(nr_requests / nr_misses) + * To calculate tolerated miss ratio from it, use + * f(x) ~= 2 ^ -(x + 1) + * + * fallback ~ 3.1% + * throttle ~ 0.4% + */ +#define BLK_POLL_FALLBACK_THRESHOLD 4 +#define BLK_POLL_THROTTLE_THRESHOLD 7 + static void blk_mq_update_poll_info(struct poll_info *pi, struct blk_rq_stat *stat) { - u64 sleep_ns; + u64 half_mean, indent, sleep_ns; u32 nr_misses, nr_samples; nr_samples = stat->nr_samples; @@ -3349,14 +3360,33 @@ static void blk_mq_update_poll_info(struct poll_info *pi, if (nr_misses > nr_samples) nr_misses = nr_samples; - if (!nr_samples) + half_mean = (stat->mean + 1) / 2; + indent = stat->lmd * 4; + + if (!stat->nr_samples) { sleep_ns = 0; - else - sleep_ns = (stat->mean + 1) / 2; + } else if (!stat->lmd || stat->mean <= indent) { + sleep_ns = half_mean; + } else { + int ratio = INT_MAX; - /* - * Use miss ratio here to adjust sleep time - */ + sleep_ns = stat->mean - indent; + + /* + * If a completion is overslept, the observable time will + * be greater than the actual, so increasing mean. It + * also increases sleep time estimation, synergistically + * backfiring on mean. Need to scale down / fallback early. + */ + if (nr_misses) + ratio = ilog2(nr_samples / nr_misses); + if (ratio <= BLK_POLL_FALLBACK_THRESHOLD) + sleep_ns = half_mean; + else if (ratio <= BLK_POLL_THROTTLE_THRESHOLD) + sleep_ns -= sleep_ns / 4; + + sleep_ns = max(sleep_ns, half_mean); + } pi->stat = *stat; pi->sleep_ns = sleep_ns; -- 2.21.0