For high latency devices,we need to take into account the inflight IOs for calculating the disk utilization. For IO accounting in block layer currently, the io ticks are updated (added) with '1' at the start of the IO and delta (time difference between start and end of the IO) is added at the end of the IO. This causes a small issue for high latency devices. Multiple IOs can come in before the end of the previous IOs. Suppose, IO 'A' came in and before its completion a couple of IOs 'B', 'C', 'D' and 'E' came in. As per the current implementation, we add only 1 to the ioticks at IO arrival and the disk time 'stamp' is updated with current time ('now'). Now, just after 'E' IO 'A' completion arrived. For this completion of 'A', we update the ioticks with delta which is nothing but "A(end_time) - E(start_time('stamp'))". Eventhough the disk was busy in processing the IOs B,C,D also, we were missing the processing time of B,C,D IOs to add to ioticks. This was causing the %util to show the disk utilization incorrectly. This incorrect behavior was causing it impossible to drive the disk utilization to 100% even for heavy load. To fix this, we need to take into account the inflight IOs also for calculating the disk utilization. While updating the ioticks when IO arrives, check if there are any inflight IOs. If there are any inflight IOs, then add delta to the ioticks instead of 1. This may not be an issue for low latency devices as there will be very less difference between the start and end of an IO. Also, if we add this inflight IO check for IO accounting of low latency devices, then it will be an overhead and impact IO performance. So, this inflight IO check will be added only to high latency devices like HDDs and not to low latency devices like NVME. In the fix, this distinction is made based upon the number of hardware 'queues' supported by the disk. Usually HDDs support only 1 HW queue and NVME devices support multiple HW queues. To find if there are any inflight IOs, the fix will iterate through both the IO tags and scheduler tags of the hardware context (hctx) to see if they are occupied or not. If the tags taken are greater than 1 (first one, we take it for the current IO), then we say that there are in inflight IOs The following is the fio script used to test the fix with correct results: [global] bs=64 iodepth=120 direct=1 ioengine=libaio group_reporting time_based runtime=100 numjobs=1 name=raw-read rw=randread ignore_error=EIO:EIO [job1] filename=/dev/sdb Results without fix ------------------- Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz %util sda 4001.00 256064.00 0.00 0.00 26.20 64.00 69.00 lat (usec): min=1639, max=295085, avg=26327.48, stdev=60948.84 Result with fix --------------- Device r/s rkB/s %rrqm r_await rareq-sz %util sdb 3970.00 254080.00 0.00 26.32 64.00 100.00 lat (usec): min=1553, max=281217, avg=26319.47, stdev=60563.99 Changes V1->V2: 1. Created a new function blk_mq_hctx_has_tags() instead of using the existing function part_in_flight() 2. This new function will check if there are any inflight IOs and returns true if any inflight. Else returns false Signed-off-by: Gulam Mohamed <gulam.mohamed@xxxxxxxxxx> --- block/blk-core.c | 11 ++++++++++- block/blk-mq.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++- block/blk-mq.h | 1 + 3 files changed, 60 insertions(+), 2 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 9d51e9894ece..ea59f4f3cbdf 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -953,8 +953,17 @@ void update_io_ticks(struct block_device *part, unsigned long now, bool end) unsigned long bdev_start_io_acct(struct block_device *bdev, enum req_op op, unsigned long start_time) { + bool inflight = false; + struct blk_mq_hw_ctx *hctx; + part_stat_lock(); - update_io_ticks(bdev, start_time, false); + + if (bdev->bd_queue->nr_hw_queues == 1) { + hctx = xa_load(&bdev->bd_queue->hctx_table, 0); + inflight = blk_mq_hctx_has_tags(hctx); + } + + update_io_ticks(bdev, start_time, inflight); part_stat_local_inc(bdev, in_flight[op_is_write(op)]); part_stat_unlock(); diff --git a/block/blk-mq.c b/block/blk-mq.c index 1fafd54dce3c..5748327de92d 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1000,6 +1000,9 @@ static inline void blk_account_io_done(struct request *req, u64 now) static inline void blk_account_io_start(struct request *req) { + bool inflight = false; + struct blk_mq_hw_ctx *hctx; + trace_block_io_start(req); if (blk_do_io_stat(req)) { @@ -1015,7 +1018,13 @@ static inline void blk_account_io_start(struct request *req) req->part = req->q->disk->part0; part_stat_lock(); - update_io_ticks(req->part, jiffies, false); + + if (req->q->nr_hw_queues == 1) { + hctx = xa_load(&req->q->hctx_table, 0); + inflight = blk_mq_hctx_has_tags(hctx); + } + + update_io_ticks(req->part, jiffies, inflight); part_stat_unlock(); } } @@ -3476,6 +3485,45 @@ static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx) return data.has_rq; } +struct tags_iter_data { + struct blk_mq_hw_ctx *hctx; + unsigned tags; + bool has_tags; +}; + +static bool blk_mq_hctx_has_tag(struct request *rq, void *data) +{ + struct tags_iter_data *iter_data = data; + + if (rq->mq_hctx != iter_data->hctx) + return true; + + iter_data->tags++; + + if (iter_data->tags > 1){ + iter_data->has_tags = true; + return false; + } + return true; +} + +bool blk_mq_hctx_has_tags(struct blk_mq_hw_ctx *hctx) +{ + struct tags_iter_data data = { + .hctx = hctx, + .has_tags = false, + .tags = 0, + }; + + if (hctx->sched_tags) + blk_mq_all_tag_iter(hctx->sched_tags, blk_mq_hctx_has_tag, &data); + + if (hctx->tags && !data.has_tags) + blk_mq_all_tag_iter(hctx->tags, blk_mq_hctx_has_tag, &data); + + return data.has_tags; +} + static inline bool blk_mq_last_cpu_in_hctx(unsigned int cpu, struct blk_mq_hw_ctx *hctx) { diff --git a/block/blk-mq.h b/block/blk-mq.h index 1743857e0b01..7707033dcaa2 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -240,6 +240,7 @@ unsigned int blk_mq_in_flight(struct request_queue *q, struct block_device *part); void blk_mq_in_flight_rw(struct request_queue *q, struct block_device *part, unsigned int inflight[2]); +bool blk_mq_hctx_has_tags(struct blk_mq_hw_ctx *hctx); static inline void blk_mq_put_dispatch_budget(struct request_queue *q, int budget_token) -- 2.39.3