Re: [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Jens,

On Fri, 9 Feb 2024 13:31:17 -0700, Jens Axboe <axboe@xxxxxxxxx> wrote:

> On 2/9/24 10:43 AM, Jacob Pan wrote:
> > Hi Jens,
> > 
> > On Thu, 8 Feb 2024 08:34:55 -0700, Jens Axboe <axboe@xxxxxxxxx> wrote:
> >   
> >> Hi Jacob,
> >>
> >> I gave this a quick spin, using 4 gen2 optane drives. Basic test, just
> >> IOPS bound on the drive, and using 1 thread per drive for IO. Random
> >> reads, using io_uring.
> >>
> >> For reference, using polled IO:
> >>
> >> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
> >> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
> >> IOPS=20.37M, BW=9.95GiB/s, IOS/call=31/31
> >>
> >> which is abount 5.1M/drive, which is what they can deliver.
> >>
> >> Before your patches, I see:
> >>
> >> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
> >> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
> >> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
> >> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
> >>
> >> at 2.82M ints/sec. With the patches, I see:
> >>
> >> IOPS=14.73M, BW=7.19GiB/s, IOS/call=32/31
> >> IOPS=14.90M, BW=7.27GiB/s, IOS/call=32/31
> >> IOPS=14.90M, BW=7.27GiB/s, IOS/call=31/32
> >>
> >> at 2.34M ints/sec. So a nice reduction in interrupt rate, though not
> >> quite at the extent I expected. Booted with 'posted_msi' and I do see
> >> posted interrupts increasing in the PMN in /proc/interrupts, 
> >>  
> > The ints/sec reduction is not as high as I expected either, especially
> > at this high rate. Which means not enough coalescing going on to get the
> > performance benefits.  
> 
> Right, it means that we're getting pretty decent commands-per-int
> coalescing already. I added another drive and repeated, here's that one:
> 
> IOPS w/polled: 25.7M IOPS
> 
> Stock kernel:
> 
> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32
> IOPS=21.44M, BW=10.47GiB/s, IOS/call=32/32
> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32
> 
> at ~3.7M ints/sec, or about 5.8 IOPS / int on average.
> 
> Patched kernel:
> 
> IOPS=21.90M, BW=10.69GiB/s, IOS/call=31/32
> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/31
> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/32
> 
> at the same interrupt rate. So not a reduction, but slighter higher
> perf. Maybe we're reaping more commands on average per interrupt.
> 
> Anyway, not a lot of interesting data there, just figured I'd re-run it
> with the added drive.
> 
> > The opportunity of IRQ coalescing is also dependent on how long the
> > driver's hardirq handler executes. In the posted MSI demux loop, it does
> > not wait for more MSIs to come before existing the pending IRQ polling
> > loop. So if the hardirq handler finishes very quickly, it may not
> > coalesce as much. Perhaps, we need to find more "useful" work to do to
> > maximize the window for coalescing.
> > 
> > I am not familiar with optane driver, need to look into how its hardirq
> > handler work. I have only tested NVMe gen5 in terms of storage IO, i saw
> > 30-50% ints/sec reduction at even lower IRQ rate (200k/sec).  
> 
> It's just an nvme device, so it's the nvme driver. The IRQ side is very
> cheap - for as long as there are CQEs in the completion ring, it'll reap
> them and complete them. That does mean that if we get an IRQ and there's
> more than one entry to complete, we will do all of them. No IRQ
> coalescing is configured (nvme kind of sucks for that...), but optane
> media is much faster than flash, so that may be a difference.
> 
Yeah, I also check the the driver code it seems just wake up the threaded
handler.

For the record, here is my set up and performance data for 4 Samsung disks.
IOPS increased from 1.6M per disk to 2.1M. One difference I noticed is that
IRQ throughput is improved instead of reduction with this patch on my setup.
e.g. BEFORE: 185545/sec/vector 
     AFTER:  220128

CPU: (highest non-turbo freq, maybe different on yours).
echo "Set CPU frequency P1 2.7GHz"                                                                      
for i in `seq 0 1 127`; do  echo 2700000 >  /sys/devices/system/cpu/cpu$i/cpufreq/scaling_max_freq ;done
for i in `seq 0 1 127`; do  echo 2700000 >  /sys/devices/system/cpu/cpu$i/cpufreq/scaling_min_freq ;done

PCI:
[root@emr-bkc posted_msi_tests]# lspci -vv -nn -s 0000:64:00.0|grep -e Lnk -e Sam -e nvme                                                   
64:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller PM174X [144d:a826] (prog-if 02 [NVM Express]) 
        Subsystem: Samsung Electronics Co Ltd Device [144d:aa0a]                                                                            
                LnkCap: Port #0, Speed 32GT/s, Width x4, ASPM notsupported                                                                 
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled-CommClk+                                                                     
                LnkSta: Speed 32GT/s (ok), Width x4(ok)                                                                                    
                LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS
                LnkCtl2: Target Link Speed: 32GT/s, EnterCompliance- SpeedDis

NVME setup:                                            
nvme5n1       SAMSUNG MZWLO1T9HCJR-00A07                    
nvme6n1       SAMSUNG MZWLO1T9HCJR-00A07                    
nvme3n1       SAMSUNG MZWLO1T9HCJR-00A07                    
nvme4n1       SAMSUNG MZWLO1T9HCJR-00A07                    

FIO:
[global]                      
bs=4k                         
direct=1                      
norandommap                   
ioengine=libaio               
randrepeat=0                  
readwrite=randread            
group_reporting               
time_based                    
iodepth=64                    
exitall                       
random_generator=tausworthe64 
runtime=30                    
ramp_time=3                   
numjobs=8                     
group_reporting=1             
                              
#cpus_allowed_policy=shared   
cpus_allowed_policy=split     
[disk_nvme6n1_thread_1]       
filename=/dev/nvme6n1         
cpus_allowed=0-7       
[disk_nvme6n1_thread_1]
filename=/dev/nvme5n1  
cpus_allowed=8-15      
[disk_nvme5n1_thread_2]
filename=/dev/nvme4n1  
cpus_allowed=16-23     
[disk_nvme5n1_thread_3]
filename=/dev/nvme3n1  
cpus_allowed=24-31     

iostat w/o posted MSI patch, v6.8-rc1:						
nvme3c3n1     1615525.00   6462100.00         0.00         0.00    6462100						
nvme4c4n1     1615471.00   6461884.00         0.00         0.00    6461884						
nvme5c5n1     1615602.00   6462408.00         0.00         0.00    6462408						
nvme6c6n1     1614637.00   6458544.00         0.00         0.00    6458544	

irqtop (delta 1 sec.)					
           IRQ           TOTAL          DELTA NAME                                      							
           800         6290026         185545 IR-PCI-MSIX-0000:65:00.0 76-edge nvme5q76							
           797         6279554         185295 IR-PCI-MSIX-0000:65:00.0 73-edge nvme5q73							
           799         6281627         185200 IR-PCI-MSIX-0000:65:00.0 75-edge nvme5q75							
           802         6285742         185185 IR-PCI-MSIX-0000:65:00.0 78-edge nvme5q78							
	... ... similar irq rate for all 32 vectors

iostat w/ posted MSI patch:
Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd						
nvme3c3n1     2184313.00   8737256.00         0.00         0.00    8737256          0          0						
nvme4c4n1     2184241.00   8736972.00         0.00         0.00    8736972          0          0						
nvme5c5n1     2184269.00   8737080.00         0.00         0.00    8737080          0          0						
nvme6c6n1     2184003.00   8736012.00         0.00         0.00    8736012          0          0						
						
irqtop w/ posted MSI patch:
           IRQ           TOTAL           DELTA NAME                                     							
           PMN      5230078416         5502657 Posted MSI notification event            							
           423       138068935          220128 IR-PCI-MSIX-0000:64:00.0 80-edge nvme4q80							
           425       138057654          219963 IR-PCI-MSIX-0000:64:00.0 82-edge nvme4q82							
           426       138101745          219890 IR-PCI-MSIX-0000:64:00.0 83-edge nvme4q83							
	... ... similar irq rate for all 32 vectors
IRQ coalescing ratio: posted interrupt notification (PMN)/total MSIs = 78%
550/(22*32.)=.78125         


Thanks,

Jacob




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux