Adaptive IRQ moderation (also called adaptive IRQ coalescing) has been widely used in the networking stack for over 20 years and has become a standard default setting. Adaptive moderation is a feature supported by the device to delay an interrupt for a either a period of time, or number of completions (packets for networking devices) in order to optimize the effective throughput by mitigating the cost of handling interrupts which can be expensive in modern rates in networking and/or storage devices. The basic concept of adaptive moderation is to provide time and packet based control of when the device generates an interrupt in an adaptive fashion, attempting to identify the current workload based on online gathered statistics instead of a predefined configuration. When done correctly, adaptive moderation can win the best of both throughput This rfc patchset introduces a generic library that provides the mechanics for online statistics monitoring and decision making for the consumer which simply need to program the actual device while still keeping room for device specific tuning. In the networking stack, each device driver implements adaptive IRQ moderation on its own. The approach here is a bit different, it tries to take the common denominator, which is per-queue statistics gathering and workload change identification (basically decides if the moderation scheme needs to change). The library is targeted to multi-queue devices, but should work on single queue devices as well, however I'm not sure that these devices will need something like interrupt moderation. The model used in the proposed implementation requires the consumer (a.k.a the device driver) to initialize an irq adaptive moderator context (a.k.a irq_am) per its desired context which will most likely be a completion queue. The moderator is initialized with a set of am (adaptive moderation) levels, which are essentially the abstraction of the device specific moderation parameters. Usually the different levels will map to different pairs of (time <usecs>, completion-count) which are left specific to the consumer. The moderator assumes that the am levels are sorted in an increasing order when the lowest level corresponds to the optimum latency tuning (short time and low completion-count) and gradually increasing towards the throughput optimum tuning (longer time and higher completion-count). So there is a trend and tuning direction tracked by the moderator. When the moderator collects sufficient statistics (also controlled by the consumer defining nr_events), it compares the current stats with the previous stats and if a significant changed was observed in the load, the moderator attempts to increment/decrement its current level (called a step) and schedules a program dispatch work. The main reason why this implementation is different then the common networking devices implementation (and kept separate) is that in my mind at least, network devices are different animals than other I/O devices in the sense that: (a) network devices rely heavily on byte count of raw ethernet frames for adaptive moderation while in storage or I/O, the byte count is often a result of a submission/completion transaction and sometimes even known only to the application on top of the infrastructure (like in the rdma case). (b) Performance characteristics and expectations in representative workloads. (c) network devices collect all sort of stats for different functionalities (where adaptive moderation is only one use-case) and I'm not sure at all that a subset of stats could easily migrate to a different context. Having said that, if sufficient reasoning comes along, I could be convinced otherwise if unification of the implementations between networking and I/O is desired. Additionally, I hooked two consumers into this framework: 1. irq-poll - interrupt polling library (used by various rdma consumers and can be extended to others) 2. rdma cq - generic rdma cq polling abstraction library With this, both RDMA initiator mode consumers and RDMA target mode consumers are able to utilize the framework (nvme, iser, srp, nfs). Moreover, I currently do not see any reason why other HBAs (or devices in general) that support interrupt moderation wouldn't be able to hook into this framework as well. Note that the code is in *RFC* level, attempting to convey the concept. If the direction is acceptable, the setup can be easily modified and cleanups can be performed. Initial benchmarking shows promising results with 50% improvement in throughput when testing high load of small I/O. The experiment taken used nvme-rdma host vs. nvmet-rdma target exposing a null_blk device. The workload ran multithreaded fio run with high queue-depth and block size of 512B read I/O (4K block size would exceed 100 Gbe wire speed). The results without adaptive moderation reaches ~8M IOPs bottlenecking the host and target cpu. With adaptive moderation enabled, IOPs quickly converge to ~12M IOPs (at the expense of slightly higher latencies obviously) and debugfs stats show that the moderation level reached the throughput optimum level. There is currently a known issue I've observed in some conditions converging back to latency optimum (after reaching throughput optimum am levels) and I'll work to fix the tuning algorithm. Thanks to Idan Burstein for running some benchmarks on his performance setup. I've also tested this locally with my single core VMs and saw similar improvement of ~50% in throughput in a similar workload (355 KIOPs vs. 235 KIOPs). More testing will help a lot to confirm and improve the implementation. QD=1 Latency tests showed a marginal regression of up to 2% in latency (lightly tested though). The reason at this point is that the moderator still bounces in the low latency am levels constantly (would like to improve that). Another observed issue is the presence of user context polling (IOCB_HIPRI) which does not update the irq_am stats (mainly because its not interrupt driven). This can cause the moderator to do the wrong thing as its based on partial view of the load (optimize for latency instead of getting out of the poller's way). However, recent discussions raised the possibility that polling requests will be executed on a different set of queues with interrupts disabled altogether, which would make this a non-issue. None the less, I would like to get some initial feedback on the approach. Also, I'm not an expert in tuning the algorithm. The basic approach was inspired by the mlx5 driver implementation which seemed the closest to fit the abstraction level that I was aiming for. So I'd also love to get some ideas on how to tune the algorithm better for various workloads (hence the RFC). Lastly, I have also attempted to hook this into nvme (pcie), but that wasn't successful mainly because the coalescing set_feature is global to the controller and not per-queue. I'll be looking to bringing per-queue coalescing to the NVMe TWG (in case the community is interested in supporting this). Feedback would be highly appreciated, as well as a test drive with the code in case anyone is interested :) Sagi Grimberg (5): irq-am: Introduce helper library for adaptive moderation implementation irq-am: add some debugfs exposure on tuning state irq_poll: wire up irq_am and allow to initialize it IB/cq: add adaptive moderation support IB/cq: wire up adaptive moderation to workqueue based completion queues drivers/infiniband/core/cq.c | 73 ++++++++++- include/linux/irq-am.h | 118 ++++++++++++++++++ include/linux/irq_poll.h | 9 ++ include/rdma/ib_verbs.h | 9 +- lib/Kconfig | 6 + lib/Makefile | 1 + lib/irq-am.c | 291 +++++++++++++++++++++++++++++++++++++++++++ lib/irq_poll.c | 30 ++++- 8 files changed, 529 insertions(+), 8 deletions(-) create mode 100644 include/linux/irq-am.h create mode 100644 lib/irq-am.c -- 2.14.1