Problem Desc ============ The "iostat" user-space utility was showing %util as 100% for the disks which has latencies less than a milli-second i.e for latencies in the range of micro-seconds and below. Root Cause ========== The IO accounting in block layer is currently done by updating the io_ticks in jiffies which is of milli-seconds granularity. Due to this, for the devices with IO latencies less than a milli-second, the latency will be accounted as 1 milli-second even-though its in the range of micro-seconds. This was causing the iostat command to show %util as 100% which is incorrect. Recreationg of the issue ======================== Setup ----- Devices: NVMe 24 devices Model number: 4610 (Intel) fio --- [global] bs=4K iodepth=1 direct=1 ioengine=libaio group_reporting time_based runtime=100 thinktime=1ms numjobs=1 name=raw-write rw=randrw ignore_error=EIO:EIO [job1] filename=/dev/nvme0n1 iostat o/p ---------- Device %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util nvme3n1 0.00 0.05 0.00 75.38 0.50 0.00 0.00 100.00 Device %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util nvme3n1 0.00 0.05 0.00 74.74 0.50 0.00 0.00 100.00 Solution ======== Use ktime_get_ns() to update the disk_stats io_ticks so that the io_ticks are updated for every io start and end times. Issues using ktime ================== Using ktime_get_ns() has a performance overhead as ktime will go ahead and reads the clock everytime its called. Following test environment was used by Jens Axboe on which t/io_uring was run which has shown a high performance drop. Devices ------- SSDs: P5800X No of devices: 24 io_uring config --------------- Polled IO iostats: Enabled Reads: Random Block Size: 512 QDepth: 128 Batch Submit: 32 Batch Complete: 32 No of Threads: 24 With the above environment and ktime patch, it has shown a performance drop of ~25% from iostats disabled and ~19% performance drop from current iostats enabled. This performance drop is high. Suggestion from Jens Axboe ========================== Jens Axboe suggested to split the bigger patch into two as follows: 1. In first patch, change all the types from unsigned long to u64, that can be done while retaining jiffies. 2. In second patch, add an iostats == 2 setting, which enables the higher resolution mode using ktime. We'd still default to 1, lower granularity iostats enabled. Fix details =========== 1. Use ktime_get_ns() to get the current nano-seconds to update the io_ticks for start and end time stamps in block layer for io accounting 2. Create a new setting '2' in sysfs for iostats variable i.e for /sys/block/<device-name>/queue/iostats, to enable the iostat (io accounting) with nano-seconds (using ktime) granularity. This setting should be enabled only if the iostat is needed with high resolution mode as it has a high performance drop 3. Earlier available settings were 0 and 1 for disable and enable io accounting with milli-seconds granularity (jiffies) Testing ======= Ran the t/io_uring command with following setup: Devices ------- SSDs: P4610 No of devices: 8 io_uring config --------------- Polled IO iostats: Enabled Reads: Random Block Size: 512 QDepth: 128 Batch Submit: 32 Batch Complete: 32 No of Threads: 24 io_uring o/p ------------ iostat=0, with patch: Maximum IOPS=10.09M iostat=1, with patch: Maximum IOPS=9.84M iostat=2, with patch: Maximum IOPS=9.48M Changes from V2 to V3 ===================== 1. Changed all the required variables data-type to u64 as a first patch 2. Create a new setting '2' for iostats in sysfs in this patch 3. Change the code to get the ktime values when iostat=2, in this patch Signed-off-by: Gulam Mohamed <gulam.mohamed@xxxxxxxxxx> --- block/blk-core.c | 26 +++++++++++++++++---- block/blk-mq.c | 4 ++-- block/blk-sysfs.c | 39 ++++++++++++++++++++++++++++++- block/genhd.c | 18 ++++++++++---- drivers/block/drbd/drbd_debugfs.c | 12 ++++++++-- drivers/block/zram/zram_drv.c | 3 ++- drivers/md/dm.c | 13 +++++++++-- include/linux/blkdev.h | 4 ++++ 8 files changed, 103 insertions(+), 16 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 5670032fe932..0b5e4eb909a5 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -927,6 +927,18 @@ int iocb_bio_iopoll(struct kiocb *kiocb, struct io_comp_batch *iob, } EXPORT_SYMBOL_GPL(iocb_bio_iopoll); +/* + * Get the time based upon the available granularity for io accounting + * If the resolution mode is set to precise (2) i.e + * (/sys/block/<device>/queue/iostats = 2), then this will return time + * in nano-seconds else this will return time in jiffies + */ +u64 blk_get_iostat_ticks(struct request_queue *q) +{ + return (blk_queue_precise_io_stat(q) ? ktime_get_ns() : jiffies); +} +EXPORT_SYMBOL_GPL(blk_get_iostat_ticks); + void update_io_ticks(struct block_device *part, u64 now, bool end) { u64 stamp; @@ -968,20 +980,26 @@ EXPORT_SYMBOL(bdev_start_io_acct); u64 bio_start_io_acct(struct bio *bio) { return bdev_start_io_acct(bio->bi_bdev, bio_sectors(bio), - bio_op(bio), jiffies); + bio_op(bio), + blk_get_iostat_ticks(bio->bi_bdev->bd_queue)); } EXPORT_SYMBOL_GPL(bio_start_io_acct); void bdev_end_io_acct(struct block_device *bdev, enum req_op op, u64 start_time) { + u64 now; + u64 duration; + struct request_queue *q = bdev_get_queue(bdev); const int sgrp = op_stat_group(op); - u64 now = READ_ONCE(jiffies); - u64 duration = (unsigned long)now -(unsigned long) start_time; + now = blk_get_iostat_ticks(q);; + duration = (unsigned long)now -(unsigned long) start_time; part_stat_lock(); update_io_ticks(bdev, now, true); - part_stat_add(bdev, nsecs[sgrp], jiffies_to_nsecs(duration)); + part_stat_add(bdev, nsecs[sgrp], + (blk_queue_precise_io_stat(q) ? duration : + jiffies_to_nsecs(duration))); part_stat_local_dec(bdev, in_flight[op_is_write(op)]); part_stat_unlock(); } diff --git a/block/blk-mq.c b/block/blk-mq.c index 4e6b3ccd4989..5318bf87f17e 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -975,7 +975,7 @@ static void __blk_account_io_done(struct request *req, u64 now) const int sgrp = op_stat_group(req_op(req)); part_stat_lock(); - update_io_ticks(req->part, jiffies, true); + update_io_ticks(req->part, blk_get_iostat_ticks(req->q), true); part_stat_inc(req->part, ios[sgrp]); part_stat_add(req->part, nsecs[sgrp], now - req->start_time_ns); part_stat_unlock(); @@ -1007,7 +1007,7 @@ static void __blk_account_io_start(struct request *rq) rq->part = rq->q->disk->part0; part_stat_lock(); - update_io_ticks(rq->part, jiffies, false); + update_io_ticks(rq->part, blk_get_iostat_ticks(rq->q), false); part_stat_unlock(); } diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 93d9e9c9a6ea..e7959782ce59 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -307,7 +307,6 @@ queue_##name##_store(struct request_queue *q, const char *page, size_t count) \ QUEUE_SYSFS_BIT_FNS(nonrot, NONROT, 1); QUEUE_SYSFS_BIT_FNS(random, ADD_RANDOM, 0); -QUEUE_SYSFS_BIT_FNS(iostats, IO_STAT, 0); QUEUE_SYSFS_BIT_FNS(stable_writes, STABLE_WRITES, 0); #undef QUEUE_SYSFS_BIT_FNS @@ -363,6 +362,44 @@ static ssize_t queue_nomerges_store(struct request_queue *q, const char *page, return ret; } +static ssize_t queue_iostats_show(struct request_queue *q, char *page) +{ + bool iostat = blk_queue_io_stat(q); + bool precise_iostat = blk_queue_precise_io_stat(q); + + return queue_var_show(iostat << precise_iostat, page); +} + +static ssize_t queue_iostats_store(struct request_queue *q, const char *page, + size_t count) +{ + unsigned long val; + ssize_t ret = -EINVAL; + + ret = queue_var_store(&val, page, count); + + if (ret < 0) + return ret; + + if (val == 2) { + blk_queue_flag_set(QUEUE_FLAG_IO_STAT, q); + blk_queue_flag_set(QUEUE_FLAG_PRECISE_IO_STAT, q); + q->disk->part0->bd_stamp = 0; + } + else if (val == 1) { + blk_queue_flag_set(QUEUE_FLAG_IO_STAT, q); + blk_queue_flag_clear(QUEUE_FLAG_PRECISE_IO_STAT, q); + q->disk->part0->bd_stamp = 0; + } + else if (val == 0) { + blk_queue_flag_clear(QUEUE_FLAG_IO_STAT, q); + blk_queue_flag_clear(QUEUE_FLAG_PRECISE_IO_STAT, q); + q->disk->part0->bd_stamp = 0; + } + + return ret; +} + static ssize_t queue_rq_affinity_show(struct request_queue *q, char *page) { bool set = test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags); diff --git a/block/genhd.c b/block/genhd.c index 03a96d6473e1..d034219a4683 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -951,6 +951,7 @@ ssize_t part_stat_show(struct device *dev, struct request_queue *q = bdev_get_queue(bdev); struct disk_stats stat; unsigned int inflight; + u64 stat_ioticks; if (queue_is_mq(q)) inflight = blk_mq_in_flight(q, bdev); @@ -959,10 +960,13 @@ ssize_t part_stat_show(struct device *dev, if (inflight) { part_stat_lock(); - update_io_ticks(bdev, jiffies, true); + update_io_ticks(bdev, blk_get_iostat_ticks(q), true); part_stat_unlock(); } part_stat_read_all(bdev, &stat); + stat_ioticks = (blk_queue_precise_io_stat(q) ? + div_u64(stat.io_ticks, NSEC_PER_MSEC) : + jiffies_to_msecs(stat.io_ticks)); return sprintf(buf, "%8lu %8lu %8llu %8u " "%8lu %8lu %8llu %8u " @@ -979,7 +983,7 @@ ssize_t part_stat_show(struct device *dev, (unsigned long long)stat.sectors[STAT_WRITE], (unsigned int)div_u64(stat.nsecs[STAT_WRITE], NSEC_PER_MSEC), inflight, - jiffies_to_msecs(stat.io_ticks), + (unsigned int)stat_ioticks, (unsigned int)div_u64(stat.nsecs[STAT_READ] + stat.nsecs[STAT_WRITE] + stat.nsecs[STAT_DISCARD] + @@ -1217,6 +1221,8 @@ static int diskstats_show(struct seq_file *seqf, void *v) unsigned int inflight; struct disk_stats stat; unsigned long idx; + struct request_queue *q; + u64 stat_ioticks; /* if (&disk_to_dev(gp)->kobj.entry == block_class.devices.next) @@ -1235,12 +1241,16 @@ static int diskstats_show(struct seq_file *seqf, void *v) else inflight = part_in_flight(hd); + q = bdev_get_queue(hd); if (inflight) { part_stat_lock(); - update_io_ticks(hd, jiffies, true); + update_io_ticks(hd, blk_get_iostat_ticks(q), true); part_stat_unlock(); } part_stat_read_all(hd, &stat); + stat_ioticks = (blk_queue_precise_io_stat(q) ? + div_u64(stat.io_ticks, NSEC_PER_MSEC) : + jiffies_to_msecs(stat.io_ticks)); seq_printf(seqf, "%4d %7d %pg " "%lu %lu %lu %u " "%lu %lu %lu %u " @@ -1260,7 +1270,7 @@ static int diskstats_show(struct seq_file *seqf, void *v) (unsigned int)div_u64(stat.nsecs[STAT_WRITE], NSEC_PER_MSEC), inflight, - jiffies_to_msecs(stat.io_ticks), + (unsigned int)stat_ioticks, (unsigned int)div_u64(stat.nsecs[STAT_READ] + stat.nsecs[STAT_WRITE] + stat.nsecs[STAT_DISCARD] + diff --git a/drivers/block/drbd/drbd_debugfs.c b/drivers/block/drbd/drbd_debugfs.c index a72c096aa5b1..af193bcc4f3a 100644 --- a/drivers/block/drbd/drbd_debugfs.c +++ b/drivers/block/drbd/drbd_debugfs.c @@ -97,6 +97,12 @@ static void seq_print_one_request(struct seq_file *m, struct drbd_request *req, { /* change anything here, fixup header below! */ unsigned int s = req->rq_state; + unsigned long start_time; + struct request_queue *q = req->device->rq_queue; + + start_time = (blk_queue_precise_io_stat(q) ? + nsecs_to_jiffies(req->start_jif) : + req->start_jif); #define RQ_HDR_1 "epoch\tsector\tsize\trw" seq_printf(m, "0x%x\t%llu\t%u\t%s", @@ -105,7 +111,7 @@ static void seq_print_one_request(struct seq_file *m, struct drbd_request *req, (s & RQ_WRITE) ? "W" : "R"); #define RQ_HDR_2 "\tstart\tin AL\tsubmit" - seq_printf(m, "\t%d", jiffies_to_msecs(now - req->start_jif)); + seq_printf(m, "\t%d", jiffies_to_msecs(now - start_time)); seq_print_age_or_dash(m, s & RQ_IN_ACT_LOG, now - req->in_actlog_jif); seq_print_age_or_dash(m, s & RQ_LOCAL_PENDING, now - req->pre_submit_jif); @@ -171,7 +177,9 @@ static void seq_print_waiting_for_AL(struct seq_file *m, struct drbd_resource *r /* if the oldest request does not wait for the activity log * it is not interesting for us here */ if (req && !(req->rq_state & RQ_IN_ACT_LOG)) - jif = req->start_jif; + jif = (blk_queue_precise_io_stat(device->rq_queue) ? + nsecs_to_jiffies(req->start_jif): + req->start_jif); else req = NULL; spin_unlock_irq(&device->resource->req_lock); diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index da28eb83e6ed..8e22487de7de 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1663,6 +1663,7 @@ static int zram_rw_page(struct block_device *bdev, sector_t sector, struct zram *zram; struct bio_vec bv; u64 start_time; + struct request_queue *q = bdev_get_queue(bdev); if (PageTransHuge(page)) return -ENOTSUPP; @@ -1682,7 +1683,7 @@ static int zram_rw_page(struct block_device *bdev, sector_t sector, bv.bv_offset = 0; start_time = bdev_start_io_acct(bdev->bd_disk->part0, - SECTORS_PER_PAGE, op, jiffies); + SECTORS_PER_PAGE, op, blk_get_iostat_ticks(q)); ret = zram_bvec_rw(zram, &bv, index, offset, op, NULL); bdev_end_io_acct(bdev->bd_disk->part0, op, start_time); out: diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 011a85ea40da..1bb58a0b8cd1 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -482,7 +482,11 @@ static int dm_blk_ioctl(struct block_device *bdev, fmode_t mode, u64 dm_start_time_ns_from_clone(struct bio *bio) { - return jiffies_to_nsecs(clone_to_tio(bio)->io->start_time); + struct request_queue *q = bdev_get_queue(bio->bi_bdev); + u64 start_time_ns = (blk_queue_precise_io_stat(q) ? + clone_to_tio(bio)->io->start_time : + jiffies_to_nsecs(clone_to_tio(bio)->io->start_time)); + return start_time_ns; } EXPORT_SYMBOL_GPL(dm_start_time_ns_from_clone); @@ -498,6 +502,11 @@ static void dm_io_acct(struct dm_io *io, bool end) struct mapped_device *md = io->md; struct bio *bio = io->orig_bio; unsigned int sectors; + struct request_queue *q = bdev_get_queue(bio->bi_bdev); + + u64 start_time = (blk_queue_precise_io_stat(q) ? + nsecs_to_jiffies(io->start_time) : + io->start_time); /* * If REQ_PREFLUSH set, don't account payload, it will be @@ -589,7 +598,7 @@ static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio) io->orig_bio = bio; io->md = md; spin_lock_init(&io->lock); - io->start_time = jiffies; + io->start_time = blk_get_iostat_ticks(bio->bi_bdev->bd_queue); io->flags = 0; if (static_branch_unlikely(&stats_enabled)) diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index ca94d690d292..0543a536c6e5 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -570,6 +570,7 @@ struct request_queue { #define QUEUE_FLAG_NOWAIT 29 /* device supports NOWAIT */ #define QUEUE_FLAG_SQ_SCHED 30 /* single queue style io dispatch */ #define QUEUE_FLAG_SKIP_TAGSET_QUIESCE 31 /* quiesce_tagset skip the queue*/ +#define QUEUE_FLAG_PRECISE_IO_STAT 32 /* To enable precise io accounting */ #define QUEUE_FLAG_MQ_DEFAULT ((1UL << QUEUE_FLAG_IO_STAT) | \ (1UL << QUEUE_FLAG_SAME_COMP) | \ @@ -578,6 +579,7 @@ struct request_queue { void blk_queue_flag_set(unsigned int flag, struct request_queue *q); void blk_queue_flag_clear(unsigned int flag, struct request_queue *q); bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q); +u64 blk_get_iostat_ticks(struct request_queue *q); #define blk_queue_stopped(q) test_bit(QUEUE_FLAG_STOPPED, &(q)->queue_flags) #define blk_queue_dying(q) test_bit(QUEUE_FLAG_DYING, &(q)->queue_flags) @@ -589,6 +591,8 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q); #define blk_queue_stable_writes(q) \ test_bit(QUEUE_FLAG_STABLE_WRITES, &(q)->queue_flags) #define blk_queue_io_stat(q) test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags) +#define blk_queue_precise_io_stat(q) \ + test_bit(QUEUE_FLAG_PRECISE_IO_STAT, &(q)->queue_flags) #define blk_queue_add_random(q) test_bit(QUEUE_FLAG_ADD_RANDOM, &(q)->queue_flags) #define blk_queue_zone_resetall(q) \ test_bit(QUEUE_FLAG_ZONE_RESETALL, &(q)->queue_flags) -- 2.31.1