Hi Jens, On Fri, 9 Feb 2024 13:31:17 -0700, Jens Axboe <axboe@xxxxxxxxx> wrote: > On 2/9/24 10:43 AM, Jacob Pan wrote: > > Hi Jens, > > > > On Thu, 8 Feb 2024 08:34:55 -0700, Jens Axboe <axboe@xxxxxxxxx> wrote: > > > >> Hi Jacob, > >> > >> I gave this a quick spin, using 4 gen2 optane drives. Basic test, just > >> IOPS bound on the drive, and using 1 thread per drive for IO. Random > >> reads, using io_uring. > >> > >> For reference, using polled IO: > >> > >> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31 > >> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31 > >> IOPS=20.37M, BW=9.95GiB/s, IOS/call=31/31 > >> > >> which is abount 5.1M/drive, which is what they can deliver. > >> > >> Before your patches, I see: > >> > >> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32 > >> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31 > >> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31 > >> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32 > >> > >> at 2.82M ints/sec. With the patches, I see: > >> > >> IOPS=14.73M, BW=7.19GiB/s, IOS/call=32/31 > >> IOPS=14.90M, BW=7.27GiB/s, IOS/call=32/31 > >> IOPS=14.90M, BW=7.27GiB/s, IOS/call=31/32 > >> > >> at 2.34M ints/sec. So a nice reduction in interrupt rate, though not > >> quite at the extent I expected. Booted with 'posted_msi' and I do see > >> posted interrupts increasing in the PMN in /proc/interrupts, > >> > > The ints/sec reduction is not as high as I expected either, especially > > at this high rate. Which means not enough coalescing going on to get the > > performance benefits. > > Right, it means that we're getting pretty decent commands-per-int > coalescing already. I added another drive and repeated, here's that one: > > IOPS w/polled: 25.7M IOPS > > Stock kernel: > > IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32 > IOPS=21.44M, BW=10.47GiB/s, IOS/call=32/32 > IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32 > > at ~3.7M ints/sec, or about 5.8 IOPS / int on average. > > Patched kernel: > > IOPS=21.90M, BW=10.69GiB/s, IOS/call=31/32 > IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/31 > IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/32 > > at the same interrupt rate. So not a reduction, but slighter higher > perf. Maybe we're reaping more commands on average per interrupt. > > Anyway, not a lot of interesting data there, just figured I'd re-run it > with the added drive. > > > The opportunity of IRQ coalescing is also dependent on how long the > > driver's hardirq handler executes. In the posted MSI demux loop, it does > > not wait for more MSIs to come before existing the pending IRQ polling > > loop. So if the hardirq handler finishes very quickly, it may not > > coalesce as much. Perhaps, we need to find more "useful" work to do to > > maximize the window for coalescing. > > > > I am not familiar with optane driver, need to look into how its hardirq > > handler work. I have only tested NVMe gen5 in terms of storage IO, i saw > > 30-50% ints/sec reduction at even lower IRQ rate (200k/sec). > > It's just an nvme device, so it's the nvme driver. The IRQ side is very > cheap - for as long as there are CQEs in the completion ring, it'll reap > them and complete them. That does mean that if we get an IRQ and there's > more than one entry to complete, we will do all of them. No IRQ > coalescing is configured (nvme kind of sucks for that...), but optane > media is much faster than flash, so that may be a difference. > Yeah, I also check the the driver code it seems just wake up the threaded handler. For the record, here is my set up and performance data for 4 Samsung disks. IOPS increased from 1.6M per disk to 2.1M. One difference I noticed is that IRQ throughput is improved instead of reduction with this patch on my setup. e.g. BEFORE: 185545/sec/vector AFTER: 220128 CPU: (highest non-turbo freq, maybe different on yours). echo "Set CPU frequency P1 2.7GHz" for i in `seq 0 1 127`; do echo 2700000 > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_max_freq ;done for i in `seq 0 1 127`; do echo 2700000 > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_min_freq ;done PCI: [root@emr-bkc posted_msi_tests]# lspci -vv -nn -s 0000:64:00.0|grep -e Lnk -e Sam -e nvme 64:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller PM174X [144d:a826] (prog-if 02 [NVM Express]) Subsystem: Samsung Electronics Co Ltd Device [144d:aa0a] LnkCap: Port #0, Speed 32GT/s, Width x4, ASPM notsupported LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled-CommClk+ LnkSta: Speed 32GT/s (ok), Width x4(ok) LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS LnkCtl2: Target Link Speed: 32GT/s, EnterCompliance- SpeedDis NVME setup: nvme5n1 SAMSUNG MZWLO1T9HCJR-00A07 nvme6n1 SAMSUNG MZWLO1T9HCJR-00A07 nvme3n1 SAMSUNG MZWLO1T9HCJR-00A07 nvme4n1 SAMSUNG MZWLO1T9HCJR-00A07 FIO: [global] bs=4k direct=1 norandommap ioengine=libaio randrepeat=0 readwrite=randread group_reporting time_based iodepth=64 exitall random_generator=tausworthe64 runtime=30 ramp_time=3 numjobs=8 group_reporting=1 #cpus_allowed_policy=shared cpus_allowed_policy=split [disk_nvme6n1_thread_1] filename=/dev/nvme6n1 cpus_allowed=0-7 [disk_nvme6n1_thread_1] filename=/dev/nvme5n1 cpus_allowed=8-15 [disk_nvme5n1_thread_2] filename=/dev/nvme4n1 cpus_allowed=16-23 [disk_nvme5n1_thread_3] filename=/dev/nvme3n1 cpus_allowed=24-31 iostat w/o posted MSI patch, v6.8-rc1: nvme3c3n1 1615525.00 6462100.00 0.00 0.00 6462100 nvme4c4n1 1615471.00 6461884.00 0.00 0.00 6461884 nvme5c5n1 1615602.00 6462408.00 0.00 0.00 6462408 nvme6c6n1 1614637.00 6458544.00 0.00 0.00 6458544 irqtop (delta 1 sec.) IRQ TOTAL DELTA NAME 800 6290026 185545 IR-PCI-MSIX-0000:65:00.0 76-edge nvme5q76 797 6279554 185295 IR-PCI-MSIX-0000:65:00.0 73-edge nvme5q73 799 6281627 185200 IR-PCI-MSIX-0000:65:00.0 75-edge nvme5q75 802 6285742 185185 IR-PCI-MSIX-0000:65:00.0 78-edge nvme5q78 ... ... similar irq rate for all 32 vectors iostat w/ posted MSI patch: Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd nvme3c3n1 2184313.00 8737256.00 0.00 0.00 8737256 0 0 nvme4c4n1 2184241.00 8736972.00 0.00 0.00 8736972 0 0 nvme5c5n1 2184269.00 8737080.00 0.00 0.00 8737080 0 0 nvme6c6n1 2184003.00 8736012.00 0.00 0.00 8736012 0 0 irqtop w/ posted MSI patch: IRQ TOTAL DELTA NAME PMN 5230078416 5502657 Posted MSI notification event 423 138068935 220128 IR-PCI-MSIX-0000:64:00.0 80-edge nvme4q80 425 138057654 219963 IR-PCI-MSIX-0000:64:00.0 82-edge nvme4q82 426 138101745 219890 IR-PCI-MSIX-0000:64:00.0 83-edge nvme4q83 ... ... similar irq rate for all 32 vectors IRQ coalescing ratio: posted interrupt notification (PMN)/total MSIs = 78% 550/(22*32.)=.78125 Thanks, Jacob