Hi Jens, On Mon, 12 Feb 2024 11:36:42 -0700, Jens Axboe <axboe@xxxxxxxxx> wrote: > On 2/12/24 11:27 AM, Jacob Pan wrote: > > Hi Jens, > > > > On Fri, 9 Feb 2024 13:31:17 -0700, Jens Axboe <axboe@xxxxxxxxx> wrote: > > > >> On 2/9/24 10:43 AM, Jacob Pan wrote: > >>> Hi Jens, > >>> > >>> On Thu, 8 Feb 2024 08:34:55 -0700, Jens Axboe <axboe@xxxxxxxxx> wrote: > >>> > >>>> Hi Jacob, > >>>> > >>>> I gave this a quick spin, using 4 gen2 optane drives. Basic test, > >>>> just IOPS bound on the drive, and using 1 thread per drive for IO. > >>>> Random reads, using io_uring. > >>>> > >>>> For reference, using polled IO: > >>>> > >>>> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31 > >>>> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31 > >>>> IOPS=20.37M, BW=9.95GiB/s, IOS/call=31/31 > >>>> > >>>> which is abount 5.1M/drive, which is what they can deliver. > >>>> > >>>> Before your patches, I see: > >>>> > >>>> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32 > >>>> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31 > >>>> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31 > >>>> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32 > >>>> > >>>> at 2.82M ints/sec. With the patches, I see: > >>>> > >>>> IOPS=14.73M, BW=7.19GiB/s, IOS/call=32/31 > >>>> IOPS=14.90M, BW=7.27GiB/s, IOS/call=32/31 > >>>> IOPS=14.90M, BW=7.27GiB/s, IOS/call=31/32 > >>>> > >>>> at 2.34M ints/sec. So a nice reduction in interrupt rate, though not > >>>> quite at the extent I expected. Booted with 'posted_msi' and I do see > >>>> posted interrupts increasing in the PMN in /proc/interrupts, > >>>> > >>> The ints/sec reduction is not as high as I expected either, especially > >>> at this high rate. Which means not enough coalescing going on to get > >>> the performance benefits. > >> > >> Right, it means that we're getting pretty decent commands-per-int > >> coalescing already. I added another drive and repeated, here's that > >> one: > >> > >> IOPS w/polled: 25.7M IOPS > >> > >> Stock kernel: > >> > >> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32 > >> IOPS=21.44M, BW=10.47GiB/s, IOS/call=32/32 > >> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32 > >> > >> at ~3.7M ints/sec, or about 5.8 IOPS / int on average. > >> > >> Patched kernel: > >> > >> IOPS=21.90M, BW=10.69GiB/s, IOS/call=31/32 > >> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/31 > >> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/32 > >> > >> at the same interrupt rate. So not a reduction, but slighter higher > >> perf. Maybe we're reaping more commands on average per interrupt. > >> > >> Anyway, not a lot of interesting data there, just figured I'd re-run it > >> with the added drive. > >> > >>> The opportunity of IRQ coalescing is also dependent on how long the > >>> driver's hardirq handler executes. In the posted MSI demux loop, it > >>> does not wait for more MSIs to come before existing the pending IRQ > >>> polling loop. So if the hardirq handler finishes very quickly, it may > >>> not coalesce as much. Perhaps, we need to find more "useful" work to > >>> do to maximize the window for coalescing. > >>> > >>> I am not familiar with optane driver, need to look into how its > >>> hardirq handler work. I have only tested NVMe gen5 in terms of > >>> storage IO, i saw 30-50% ints/sec reduction at even lower IRQ rate > >>> (200k/sec). > >> > >> It's just an nvme device, so it's the nvme driver. The IRQ side is very > >> cheap - for as long as there are CQEs in the completion ring, it'll > >> reap them and complete them. That does mean that if we get an IRQ and > >> there's more than one entry to complete, we will do all of them. No IRQ > >> coalescing is configured (nvme kind of sucks for that...), but optane > >> media is much faster than flash, so that may be a difference. > >> > > Yeah, I also check the the driver code it seems just wake up the > > threaded handler. > > That only happens if you're using threaded interrupts, which is not the > default as it's much slower. What happens for the normal case is that we > init a batch, and then poll the CQ ring for completions, and then add > them to the completion batch. Once no more are found, we complete the > batch. > thanks for the explanation. > You're not using threaded interrupts, are you? No. I didn't add module parameter "use_threaded_interrupts" > > > For the record, here is my set up and performance data for 4 Samsung > > disks. IOPS increased from 1.6M per disk to 2.1M. One difference I > > noticed is that IRQ throughput is improved instead of reduction with > > this patch on my setup. e.g. BEFORE: 185545/sec/vector > > AFTER: 220128 > > I'm surprised at the rates being that low, and if so, why the posted MSI > makes a difference? Usually what I've seen for IRQ being slower than > poll is if interrupt delivery is unreasonably slow on that architecture > of machine. But ~200k/sec isn't that high at all. > > > [global] > > bs=4k > > direct=1 > > norandommap > > ioengine=libaio > > randrepeat=0 > > readwrite=randread > > group_reporting > > time_based > > iodepth=64 > > exitall > > random_generator=tausworthe64 > > runtime=30 > > ramp_time=3 > > numjobs=8 > > group_reporting=1 > > > > #cpus_allowed_policy=shared > > cpus_allowed_policy=split > > [disk_nvme6n1_thread_1] > > filename=/dev/nvme6n1 > > cpus_allowed=0-7 > > [disk_nvme6n1_thread_1] > > filename=/dev/nvme5n1 > > cpus_allowed=8-15 > > [disk_nvme5n1_thread_2] > > filename=/dev/nvme4n1 > > cpus_allowed=16-23 > > [disk_nvme5n1_thread_3] > > filename=/dev/nvme3n1 > > cpus_allowed=24-31 > > For better performance, I'd change that engine=libaio to: > > ioengine=io_uring > fixedbufs=1 > registerfiles=1 > > Particularly fixedbufs makes a big difference, as a big cycle consumer > is mapping/unmapping pages from the application space into the kernel > for O_DIRECT. With fixedbufs=1, this is done once and we just reuse the > buffers. At least for my runs, this is ~15% of the systime for doing IO. > It also removes the page referencing, which isn't as big a consumer, but > still noticeable. > Indeed, the CPU utilization system time goes down significantly. I got the following with posted MSI patch applied: Before (aio): read: IOPS=8925k, BW=34.0GiB/s (36.6GB/s)(1021GiB/30001msec) user 3m25.156s sys 11m16.785s After (fixedbufs, iouring engine): read: IOPS=8811k, BW=33.6GiB/s (36.1GB/s)(1008GiB/30002msec) user 2m56.255s sys 8m56.378s It seems to have no gain in IOPS, just CPU utilization reduction. Both have improvement over libaio w/o posted MSI patch. > Anyway, side quest, but I think you'll find this considerably reduces > overhead / improves performance. Also makes it so that you can compare > with polled IO on nvme, which aio can't do. You'd just add hipri=1 as an > option for that (with a side note that you need to configure nvme poll > queues, see the poll_queues parameter). > > On my box, all the NVMe devices seem to be on node1, not node0 which > looks like it's the CPUs you are using. Might be worth checking and > adjusting your CPU domains for each drive? I also tend to get better > performance by removing the CPU scheduler, eg just pin each job to a > single CPU rather than many. It's just one process/thread anyway, so > really no point in giving it options here. It'll help reduce variability > too, which can be a pain in the butt to deal with. > Much faster with poll_queues=32 (32jobs) read: IOPS=13.0M, BW=49.6GiB/s (53.3GB/s)(1489GiB/30001msec) user 2m29.177s sys 15m7.022s Observed no IRQ counts from NVME. Thanks, Jacob