Hi Keith, so the performance results I see are very close between poll_queues and io_uring. I posted them below. Because I think this topic is pretty new to people. Is there anything we need to tell the reader/user about poll_queues. What is important to usage? And can it be dynamic or do we have only at (module) startup the ability to define poll_queues? My goal is to update the blog we built around testing Optane SSDs. Is there a possibility of creating an LWN article that will go deeper (into this change) to poll_queues? What's interesting in the below data is that the clat time for io_uring is (lower) better but the performance in IOPS is not. Pvsync2 is the most efficient, by a small margin against the newer 3D XPoint device. Thanks Frank Results: (kernel (el repo) - 5.4.1-1.el8.elrepo.x86_64 cpu - Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz - pinned to run at 3.1 fio - fio-3.16-64-gfd988 Results of Gen2 Optane SSD with poll_queues(pvsync2) v io_uring/hipri pvsync2 (poll queues) fio-3.16-64-gfd988 Starting 1 process Jobs: 1 (f=1): [r(1)][100.0%][r=552MiB/s][r=141k IOPS][eta 00m:00s] rand-read-4k-qd1: (groupid=0, jobs=1): err= 0: pid=10309: Tue Dec 31 10:49:33 2019 read: IOPS=141k, BW=552MiB/s (579MB/s)(64.7GiB/120001msec) clat (nsec): min=6548, max=186309, avg=6809.48, stdev=497.58 lat (nsec): min=6572, max=186333, avg=6834.24, stdev=499.28 clat percentiles (usec): | 1.0000th=[ 7], 5.0000th=[ 7], 10.0000th=[ 7], | 20.0000th=[ 7], 30.0000th=[ 7], 40.0000th=[ 7], | 50.0000th=[ 7], 60.0000th=[ 7], 70.0000th=[ 7], | 80.0000th=[ 7], 90.0000th=[ 7], 95.0000th=[ 8], | 99.0000th=[ 8], 99.5000th=[ 8], 99.9000th=[ 9], | 99.9500th=[ 10], 99.9900th=[ 18], 99.9990th=[ 117], | 99.9999th=[ 163] bw ( KiB/s): min=563512, max=567392, per=100.00%, avg=565635.38, stdev=846.99, samples=239 iops : min=140878, max=141848, avg=141408.82, stdev=211.76, samples=239 lat (usec) : 10=99.97%, 20=0.03%, 50=0.01%, 100=0.01%, 250=0.01% cpu : usr=6.28%, sys=93.55%, ctx=408, majf=0, minf=96 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=16969949,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=552MiB/s (579MB/s), 552MiB/s-552MiB/s (579MB/s-579MB/s), io=64.7GiB (69.5GB), run=120001-120001msec Disk stats (read/write): nvme3n1: ios=16955008/0, merge=0/0, ticks=101477/0, in_queue=0, util=99.95% io_uring: fio-3.16-64-gfd988 Starting 1 process Jobs: 1 (f=1): [r(1)][100.0%][r=538MiB/s][r=138k IOPS][eta 00m:00s] rand-read-4k-qd1: (groupid=0, jobs=1): err= 0: pid=10797: Tue Dec 31 10:53:29 2019 read: IOPS=138k, BW=539MiB/s (565MB/s)(63.1GiB/120001msec) slat (nsec): min=1029, max=161248, avg=1204.69, stdev=219.02 clat (nsec): min=262, max=208952, avg=5735.42, stdev=469.73 lat (nsec): min=6691, max=210136, avg=7008.54, stdev=516.99 clat percentiles (usec): | 1.0000th=[ 6], 5.0000th=[ 6], 10.0000th=[ 6], | 20.0000th=[ 6], 30.0000th=[ 6], 40.0000th=[ 6], | 50.0000th=[ 6], 60.0000th=[ 6], 70.0000th=[ 6], | 80.0000th=[ 6], 90.0000th=[ 6], 95.0000th=[ 6], | 99.0000th=[ 7], 99.5000th=[ 7], 99.9000th=[ 8], | 99.9500th=[ 9], 99.9900th=[ 10], 99.9990th=[ 52], | 99.9999th=[ 161] bw ( KiB/s): min=548208, max=554504, per=100.00%, avg=551620.30, stdev=984.77, samples=239 iops : min=137052, max=138626, avg=137905.07, stdev=246.17, samples=239 lat (nsec) : 500=0.01%, 750=0.01%, 1000=0.01% lat (usec) : 2=0.01%, 4=0.01%, 10=99.98%, 20=0.01%, 50=0.01% lat (usec) : 100=0.01%, 250=0.01% cpu : usr=7.39%, sys=92.44%, ctx=408, majf=0, minf=93 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=16548899,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=539MiB/s (565MB/s), 539MiB/s-539MiB/s (565MB/s-565MB/s), io=63.1GiB (67.8GB), run=120001-120001msec Disk stats (read/write): nvme3n1: ios=16534429/0, merge=0/0, ticks=100320/0, in_queue=0, util=99.95% Happy New Year Keith! -----Original Message----- From: Keith Busch <kbusch@xxxxxxxxxx> Sent: Friday, December 20, 2019 1:21 PM To: Ober, Frank <frank.ober@xxxxxxxxx> Cc: linux-block@xxxxxxxxxxxxxxx; linux-nvme@xxxxxxxxxxxxxxxxxxx; Derrick, Jonathan <jonathan.derrick@xxxxxxxxx>; Rajendiran, Swetha <swetha.rajendiran@xxxxxxxxx>; Liang, Mark <mark.liang@xxxxxxxxx> Subject: Re: Polled io for Linux kernel 5.x On Thu, Dec 19, 2019 at 09:59:14PM +0000, Ober, Frank wrote: > Thanks Keith, it makes sense to reserve and set it up uniquely if you > can save hw interrupts. But why would io_uring then not need these > queues, because a stack trace I ran shows without the special queues I > am still entering bio_poll. With pvsync2 I can only do polled io with > the poll_queues? Polling can happen only if you have polled queues, so io_uring is not accomplishing anything by calling iopoll. I don't see an immediately good way to pass that information up to io_uring, though.