RE: Polled io for Linux kernel 5.x

"Ober, Frank" <frank.ober@xxxxxxxxx> · Tue, 31 Dec 2019 19:06:01 +0000

Hi Keith, so the performance results I see are very close between poll_queues and io_uring. I posted them below. Because I think this topic is pretty new to people.

Is there anything we need to tell the reader/user about poll_queues. What is important to usage? 

And can it be dynamic or do we have only at (module) startup the ability to define poll_queues?

My goal is to update the blog we built around testing Optane SSDs. Is there a possibility of creating an LWN article that will go deeper (into this change) to poll_queues?

What's interesting in the below data is that the clat time for io_uring is (lower) better but the performance in IOPS is not. Pvsync2 is the most efficient, by a small margin against the newer 3D XPoint device.
Thanks
Frank

Results: 
(kernel (el repo) - 5.4.1-1.el8.elrepo.x86_64
cpu - Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz  - pinned to run at 3.1
fio - fio-3.16-64-gfd988
Results of Gen2 Optane SSD with poll_queues(pvsync2) v io_uring/hipri
pvsync2 (poll queues)
fio-3.16-64-gfd988
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=552MiB/s][r=141k IOPS][eta 00m:00s]
rand-read-4k-qd1: (groupid=0, jobs=1): err= 0: pid=10309: Tue Dec 31 10:49:33 2019
  read: IOPS=141k, BW=552MiB/s (579MB/s)(64.7GiB/120001msec)
    clat (nsec): min=6548, max=186309, avg=6809.48, stdev=497.58
     lat (nsec): min=6572, max=186333, avg=6834.24, stdev=499.28
    clat percentiles (usec):
     |  1.0000th=[    7],  5.0000th=[    7], 10.0000th=[    7],
     | 20.0000th=[    7], 30.0000th=[    7], 40.0000th=[    7],
     | 50.0000th=[    7], 60.0000th=[    7], 70.0000th=[    7],
     | 80.0000th=[    7], 90.0000th=[    7], 95.0000th=[    8],
     | 99.0000th=[    8], 99.5000th=[    8], 99.9000th=[    9],
     | 99.9500th=[   10], 99.9900th=[   18], 99.9990th=[  117],
     | 99.9999th=[  163]
   bw (  KiB/s): min=563512, max=567392, per=100.00%, avg=565635.38, stdev=846.99, samples=239
   iops        : min=140878, max=141848, avg=141408.82, stdev=211.76, samples=239
  lat (usec)   : 10=99.97%, 20=0.03%, 50=0.01%, 100=0.01%, 250=0.01%
  cpu          : usr=6.28%, sys=93.55%, ctx=408, majf=0, minf=96
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=16969949,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=552MiB/s (579MB/s), 552MiB/s-552MiB/s (579MB/s-579MB/s), io=64.7GiB (69.5GB), run=120001-120001msec

Disk stats (read/write):
  nvme3n1: ios=16955008/0, merge=0/0, ticks=101477/0, in_queue=0, util=99.95%

io_uring:
fio-3.16-64-gfd988
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=538MiB/s][r=138k IOPS][eta 00m:00s]
rand-read-4k-qd1: (groupid=0, jobs=1): err= 0: pid=10797: Tue Dec 31 10:53:29 2019
  read: IOPS=138k, BW=539MiB/s (565MB/s)(63.1GiB/120001msec)
    slat (nsec): min=1029, max=161248, avg=1204.69, stdev=219.02
    clat (nsec): min=262, max=208952, avg=5735.42, stdev=469.73
     lat (nsec): min=6691, max=210136, avg=7008.54, stdev=516.99
    clat percentiles (usec):
     |  1.0000th=[    6],  5.0000th=[    6], 10.0000th=[    6],
     | 20.0000th=[    6], 30.0000th=[    6], 40.0000th=[    6],
     | 50.0000th=[    6], 60.0000th=[    6], 70.0000th=[    6],
     | 80.0000th=[    6], 90.0000th=[    6], 95.0000th=[    6],
     | 99.0000th=[    7], 99.5000th=[    7], 99.9000th=[    8],
     | 99.9500th=[    9], 99.9900th=[   10], 99.9990th=[   52],
     | 99.9999th=[  161]
   bw (  KiB/s): min=548208, max=554504, per=100.00%, avg=551620.30, stdev=984.77, samples=239
   iops        : min=137052, max=138626, avg=137905.07, stdev=246.17, samples=239
  lat (nsec)   : 500=0.01%, 750=0.01%, 1000=0.01%
  lat (usec)   : 2=0.01%, 4=0.01%, 10=99.98%, 20=0.01%, 50=0.01%
  lat (usec)   : 100=0.01%, 250=0.01%
  cpu          : usr=7.39%, sys=92.44%, ctx=408, majf=0, minf=93
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=16548899,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=539MiB/s (565MB/s), 539MiB/s-539MiB/s (565MB/s-565MB/s), io=63.1GiB (67.8GB), run=120001-120001msec

Disk stats (read/write):
  nvme3n1: ios=16534429/0, merge=0/0, ticks=100320/0, in_queue=0, util=99.95%

Happy New Year Keith!

-----Original Message-----
From: Keith Busch <kbusch@xxxxxxxxxx> 
Sent: Friday, December 20, 2019 1:21 PM
To: Ober, Frank <frank.ober@xxxxxxxxx>
Cc: linux-block@xxxxxxxxxxxxxxx; linux-nvme@xxxxxxxxxxxxxxxxxxx; Derrick, Jonathan <jonathan.derrick@xxxxxxxxx>; Rajendiran, Swetha <swetha.rajendiran@xxxxxxxxx>; Liang, Mark <mark.liang@xxxxxxxxx>
Subject: Re: Polled io for Linux kernel 5.x

On Thu, Dec 19, 2019 at 09:59:14PM +0000, Ober, Frank wrote:
> Thanks Keith, it makes sense to reserve and set it up uniquely if you 
> can save hw interrupts. But why would io_uring then not need these 
> queues, because a stack trace I ran shows without the special queues I 
> am still entering bio_poll. With pvsync2 I can only do polled io with 
> the poll_queues?

Polling can happen only if you have polled queues, so io_uring is not accomplishing anything by calling iopoll. I don't see an immediately good way to pass that information up to io_uring, though.