Hi Jens, On Thu, 8 Feb 2024 08:34:55 -0700, Jens Axboe <axboe@xxxxxxxxx> wrote: > Hi Jacob, > > I gave this a quick spin, using 4 gen2 optane drives. Basic test, just > IOPS bound on the drive, and using 1 thread per drive for IO. Random > reads, using io_uring. > > For reference, using polled IO: > > IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31 > IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31 > IOPS=20.37M, BW=9.95GiB/s, IOS/call=31/31 > > which is abount 5.1M/drive, which is what they can deliver. > > Before your patches, I see: > > IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32 > IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31 > IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31 > IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32 > > at 2.82M ints/sec. With the patches, I see: > > IOPS=14.73M, BW=7.19GiB/s, IOS/call=32/31 > IOPS=14.90M, BW=7.27GiB/s, IOS/call=32/31 > IOPS=14.90M, BW=7.27GiB/s, IOS/call=31/32 > > at 2.34M ints/sec. So a nice reduction in interrupt rate, though not > quite at the extent I expected. Booted with 'posted_msi' and I do see > posted interrupts increasing in the PMN in /proc/interrupts, > The ints/sec reduction is not as high as I expected either, especially at this high rate. Which means not enough coalescing going on to get the performance benefits. The opportunity of IRQ coalescing is also dependent on how long the driver's hardirq handler executes. In the posted MSI demux loop, it does not wait for more MSIs to come before existing the pending IRQ polling loop. So if the hardirq handler finishes very quickly, it may not coalesce as much. Perhaps, we need to find more "useful" work to do to maximize the window for coalescing. I am not familiar with optane driver, need to look into how its hardirq handler work. I have only tested NVMe gen5 in terms of storage IO, i saw 30-50% ints/sec reduction at even lower IRQ rate (200k/sec). > Probably want to fold this one in: > > diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c > index 8e09d40ea928..a289282f1cf9 100644 > --- a/arch/x86/kernel/irq.c > +++ b/arch/x86/kernel/irq.c > @@ -393,7 +393,7 @@ void intel_posted_msi_init(void) > * instead of: > * read, xchg, read, xchg, read, xchg, read, xchg > */ > -static __always_inline inline bool handle_pending_pir(u64 *pir, struct > pt_regs *regs) +static __always_inline bool handle_pending_pir(u64 *pir, > struct pt_regs *regs) { > int i, vec = FIRST_EXTERNAL_VECTOR; > unsigned long pir_copy[4]; > Good catch! will do. Thanks, Jacob