________________________________________ From: Jayaramappa, Srilakshmi Sent: Friday, April 22, 2022 5:02 PM To: axboe@xxxxxxxxx; snitzer@xxxxxxxxxx; linux-block@xxxxxxxxxxxxxxx Cc: Hunt, Joshua Subject: Precise disk statistics Hi, We install LTS kernel on our machines. While moving from 4.19.x to 5.4.x we noticed a performance drop in one of our applications. We tracked down the root cause to the commit series for removing the pending IO accounting (80a787ba3809 to 6f75723190d8) which includes 5b18b5a73760 block: delete part_round_stats and switch to less precise counting. The application (which runs on non-dm machines) tracks disk utilization to estimate the load it can further take on. After the commits in question, we see an over reporting of disk utilization [1] compared to the older method of reporting based on inflight counter [2] for the same load. The over-reporting is observed in v5.4.190 and in v5.15.35 as well. I've attached the config file used to build the kernel. We understand that the disk util% does not provide a true picture of how much more work the device is capable of doing in flash based devices and we are planning to use a different model to observe the performance potential. In the interim we are having to revert the above commit series to bring back the original reporting method. In the hopes of getting back our application's performance with a new change on top of the 5.4.x reporting (as opposed to reverting commits), I tried checking if the request queue is busy before updating io_ticks [3]. With this change the applications's throughput is closer to what we observe with the commits reverted, but still behind by ~ 6 %. Though, I am not sure that this change is safe overall. I'd appreciate your expert opinion on this matter. Could you please let us know if there is some other idea we could explore to report precise disk stats that we can build on top of existing reporting in the kernel and submit a patch, or if going back to using the inflight counters is indeed our best bet. Thank you -Sri [1] root@xxx:~# DISK=nvme3n1; dd if=/dev/$DISK of=/dev/null bs=1048576 iflag=direct count=2048 & iostat -yxm /dev/$DISK 1 1 ; wait ... 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 0.721532 s, 3.0 GB/s avg-cpu: %user %nice %system %iowait %steal %idle 0.13 0.00 0.28 1.53 0.00 98.07 Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz aqu-sz %util nvme3n1 16383.00 2047.88 0.00 0.00 0.21 128.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 72.20 [2] root@xxx:~# DISK=nvme3n1; dd if=/dev/$DISK of=/dev/null bs=1048576 iflag=direct count=2048 & iostat -yxm /dev/$DISK 1 1 ; wait ... 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 0.702101 s, 3.1 GB/s avg-cpu: %user %nice %system %iowait %steal %idle 0.03 0.00 0.18 1.57 0.00 98.22 Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz aqu-sz %util nvme3n1 16380.00 2047.50 0.00 0.00 0.20 128.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 64.20 [3] diff --git a/block/bio.c b/block/bio.c index cb38d6f3acce..8275b10a1c9a 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1754,14 +1754,17 @@ void bio_check_pages_dirty(struct bio *bio) schedule_work(&bio_dirty_work); } -void update_io_ticks(struct hd_struct *part, unsigned long now, bool end) +void update_io_ticks(struct request_queue *q, struct hd_struct *part, unsigned long now, bool end) { unsigned long stamp; again: stamp = READ_ONCE(part->stamp); if (unlikely(stamp != now)) { if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) { + if (blk_mq_queue_inflight(q)) { __part_stat_add(part, io_ticks, end ? now - stamp : 1); + } } } if (part->partno) { Sorry, resending without the config attachment since my original email bounced from linux-block. Thanks -Sri