Re: [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI

Jacob Pan <jacob.jun.pan@xxxxxxxxxxxxxxx> · Mon, 12 Feb 2024 12:13:46 -0800

Hi Jens,

On Mon, 12 Feb 2024 11:36:42 -0700, Jens Axboe <axboe@xxxxxxxxx> wrote:

> On 2/12/24 11:27 AM, Jacob Pan wrote:
> > Hi Jens,
> > 
> > On Fri, 9 Feb 2024 13:31:17 -0700, Jens Axboe <axboe@xxxxxxxxx> wrote:
> >   
> >> On 2/9/24 10:43 AM, Jacob Pan wrote:  
> >>> Hi Jens,
> >>>
> >>> On Thu, 8 Feb 2024 08:34:55 -0700, Jens Axboe <axboe@xxxxxxxxx> wrote:
> >>>     
> >>>> Hi Jacob,
> >>>>
> >>>> I gave this a quick spin, using 4 gen2 optane drives. Basic test,
> >>>> just IOPS bound on the drive, and using 1 thread per drive for IO.
> >>>> Random reads, using io_uring.
> >>>>
> >>>> For reference, using polled IO:
> >>>>
> >>>> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
> >>>> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
> >>>> IOPS=20.37M, BW=9.95GiB/s, IOS/call=31/31
> >>>>
> >>>> which is abount 5.1M/drive, which is what they can deliver.
> >>>>
> >>>> Before your patches, I see:
> >>>>
> >>>> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
> >>>> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
> >>>> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
> >>>> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
> >>>>
> >>>> at 2.82M ints/sec. With the patches, I see:
> >>>>
> >>>> IOPS=14.73M, BW=7.19GiB/s, IOS/call=32/31
> >>>> IOPS=14.90M, BW=7.27GiB/s, IOS/call=32/31
> >>>> IOPS=14.90M, BW=7.27GiB/s, IOS/call=31/32
> >>>>
> >>>> at 2.34M ints/sec. So a nice reduction in interrupt rate, though not
> >>>> quite at the extent I expected. Booted with 'posted_msi' and I do see
> >>>> posted interrupts increasing in the PMN in /proc/interrupts, 
> >>>>    
> >>> The ints/sec reduction is not as high as I expected either, especially
> >>> at this high rate. Which means not enough coalescing going on to get
> >>> the performance benefits.    
> >>
> >> Right, it means that we're getting pretty decent commands-per-int
> >> coalescing already. I added another drive and repeated, here's that
> >> one:
> >>
> >> IOPS w/polled: 25.7M IOPS
> >>
> >> Stock kernel:
> >>
> >> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32
> >> IOPS=21.44M, BW=10.47GiB/s, IOS/call=32/32
> >> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32
> >>
> >> at ~3.7M ints/sec, or about 5.8 IOPS / int on average.
> >>
> >> Patched kernel:
> >>
> >> IOPS=21.90M, BW=10.69GiB/s, IOS/call=31/32
> >> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/31
> >> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/32
> >>
> >> at the same interrupt rate. So not a reduction, but slighter higher
> >> perf. Maybe we're reaping more commands on average per interrupt.
> >>
> >> Anyway, not a lot of interesting data there, just figured I'd re-run it
> >> with the added drive.
> >>  
> >>> The opportunity of IRQ coalescing is also dependent on how long the
> >>> driver's hardirq handler executes. In the posted MSI demux loop, it
> >>> does not wait for more MSIs to come before existing the pending IRQ
> >>> polling loop. So if the hardirq handler finishes very quickly, it may
> >>> not coalesce as much. Perhaps, we need to find more "useful" work to
> >>> do to maximize the window for coalescing.
> >>>
> >>> I am not familiar with optane driver, need to look into how its
> >>> hardirq handler work. I have only tested NVMe gen5 in terms of
> >>> storage IO, i saw 30-50% ints/sec reduction at even lower IRQ rate
> >>> (200k/sec).    
> >>
> >> It's just an nvme device, so it's the nvme driver. The IRQ side is very
> >> cheap - for as long as there are CQEs in the completion ring, it'll
> >> reap them and complete them. That does mean that if we get an IRQ and
> >> there's more than one entry to complete, we will do all of them. No IRQ
> >> coalescing is configured (nvme kind of sucks for that...), but optane
> >> media is much faster than flash, so that may be a difference.
> >>  
> > Yeah, I also check the the driver code it seems just wake up the
> > threaded handler.  
> 
> That only happens if you're using threaded interrupts, which is not the
> default as it's much slower. What happens for the normal case is that we
> init a batch, and then poll the CQ ring for completions, and then add
> them to the completion batch. Once no more are found, we complete the
> batch.
> 
thanks for the explanation.

> You're not using threaded interrupts, are you?
No. I didn't add module parameter "use_threaded_interrupts"

> 
> > For the record, here is my set up and performance data for 4 Samsung
> > disks. IOPS increased from 1.6M per disk to 2.1M. One difference I
> > noticed is that IRQ throughput is improved instead of reduction with
> > this patch on my setup. e.g. BEFORE: 185545/sec/vector 
> >      AFTER:  220128  
> 
> I'm surprised at the rates being that low, and if so, why the posted MSI
> makes a difference? Usually what I've seen for IRQ being slower than
> poll is if interrupt delivery is unreasonably slow on that architecture
> of machine. But ~200k/sec isn't that high at all.
> 

> > [global]                      
> > bs=4k                         
> > direct=1                      
> > norandommap                   
> > ioengine=libaio               
> > randrepeat=0                  
> > readwrite=randread            
> > group_reporting               
> > time_based                    
> > iodepth=64                    
> > exitall                       
> > random_generator=tausworthe64 
> > runtime=30                    
> > ramp_time=3                   
> > numjobs=8                     
> > group_reporting=1             
> >                               
> > #cpus_allowed_policy=shared   
> > cpus_allowed_policy=split     
> > [disk_nvme6n1_thread_1]       
> > filename=/dev/nvme6n1         
> > cpus_allowed=0-7       
> > [disk_nvme6n1_thread_1]
> > filename=/dev/nvme5n1  
> > cpus_allowed=8-15      
> > [disk_nvme5n1_thread_2]
> > filename=/dev/nvme4n1  
> > cpus_allowed=16-23     
> > [disk_nvme5n1_thread_3]
> > filename=/dev/nvme3n1  
> > cpus_allowed=24-31       
> 
> For better performance, I'd change that engine=libaio to:
> 
> ioengine=io_uring
> fixedbufs=1
> registerfiles=1
> 
> Particularly fixedbufs makes a big difference, as a big cycle consumer
> is mapping/unmapping pages from the application space into the kernel
> for O_DIRECT. With fixedbufs=1, this is done once and we just reuse the
> buffers. At least for my runs, this is ~15% of the systime for doing IO.
> It also removes the page referencing, which isn't as big a consumer, but
> still noticeable.
> 
Indeed, the CPU utilization system time goes down significantly. I got the
following with posted MSI patch applied:
Before (aio):
  read: IOPS=8925k, BW=34.0GiB/s (36.6GB/s)(1021GiB/30001msec)
	user    3m25.156s                                                                                                           	
	sys     11m16.785s                                                                                                          	

After (fixedbufs, iouring engine):
  read: IOPS=8811k, BW=33.6GiB/s (36.1GB/s)(1008GiB/30002msec)
  	user    2m56.255s                                                                                                  	
	sys     8m56.378s                                                                                                  	

It seems to have no gain in IOPS, just CPU utilization reduction.

Both have improvement over libaio w/o posted MSI patch. 

> Anyway, side quest, but I think you'll find this considerably reduces
> overhead / improves performance. Also makes it so that you can compare
> with polled IO on nvme, which aio can't do. You'd just add hipri=1 as an
> option for that (with a side note that you need to configure nvme poll
> queues, see the poll_queues parameter).
> 
> On my box, all the NVMe devices seem to be on node1, not node0 which
> looks like it's the CPUs you are using. Might be worth checking and
> adjusting your CPU domains for each drive? I also tend to get better
> performance by removing the CPU scheduler, eg just pin each job to a
> single CPU rather than many. It's just one process/thread anyway, so
> really no point in giving it options here. It'll help reduce variability
> too, which can be a pain in the butt to deal with.
> 
Much faster with poll_queues=32 (32jobs)
  read: IOPS=13.0M, BW=49.6GiB/s (53.3GB/s)(1489GiB/30001msec)
     	user    2m29.177s
	sys     15m7.022s

Observed no IRQ counts from NVME.

Thanks,

Jacob