Re: [io_uring] Problems using io_uring engine

Hamilton Tobon Mosquera <htobonm@xxxxxxxxxxxx> · Tue, 26 May 2020 09:57:43 -0400

Thank you for your answer.

This is how I'm making sure that it is polling. The workloads take 2 
minutes, I'm checking the interrupts registered in /proc/interrupts for 
the nvme device (the Intel Optane) when the workload starts and when the 
workload ends. The interrupts count is almost zero, about 25 or so, 
while when using an interrupt based engine I get about 600K interrupts.

Also, the way I'm loading the nvme driver is:

modprobe nvme poll_queues=4

As you said I'm using 4 polling queues because I only have 4 physical 
cores. To check that they were actually created I use:

systool -vm nvme

Which shows that effectively there are 4 polling queues created.

I also checked the file /sys/block/nvme0n1/queue/io_poll and it is set 
to 1. Sometimes I change the file /sys/block/nvme0n1/queue/io_poll_delay 
to switch between hybrid and normal polling and it shows differences in 
the CPU usage, the latencies, IOPS, ...

Another way is by checking the CPU usage, which says that the CPU is 
almost completely occupied when polling.

Also, I tried with dmesg as you suggested and this is the output:

[627676.640431] nvme nvme0: 4/0/4 default/read/poll queues

I guess that shows that I was effectively using polling in the 
workloads. What is weird is that when I don't use the flag HIPRI it runs 
ok but using interrupts not polling. It might be important to say that 
I'm always running with root user.

Does this information give you more hints about the problem?. Could you 
please tell me in what filesystem polling is known to work 100% of the 
time?.

Thank you for your help.

Hamilton.

On 26/05/20 12:17 a. m., Jens Axboe wrote:
On 5/25/20 4:38 PM, Hamilton Tobon Mosquera wrote:
Thank you for your answer.

I'm using ext4. I guess it supports polling because I could get sub 10
microseconds latency with an Intel Optane SSDPED1D280GA 260GB and
pvsync2. If it helps here's how I'm running it:

fio global.fio --size=50G --ioengine=io_uring --hipri --direct=1
--rw=randwrite --iodepth=256 --bs=4K --numjobs=4 --offset_increment=25%

The global.fio has:

ioengine=io_uring
hipri
direct=1
thread=1
buffered=0
size=100%
randrepeat=0
time_based
ramp_time=0
norandommap
refill_buffers
log_max_value=1
log_avg_msec=1000
group_reporting
percentile_list=50:60:70:80:90:95:99

Your help is highly appreciated, thank you.
I almost guarantee you that you are NOT using polling. Check with
vmstat 1 and look at the interrupt rate. If it's about your IOPS
rate, then you're doing IRQ based completions. If it's closer to 0,
you're doing polled.

If your fs supports it, then you likely did not allocate poll
queues for NVMe. If nvme is builtin to the kernel, use nvme.poll_queues=N
to allocate N poll queues, or use poll_queues=N as a module parameter
if nvme is modular.

Ideally you want N to be equal to the number of CPUs in the system.
NVMe will report that it used at load time, here's an example from
my laptop:

[    2.396978] nvme nvme0: 1/8/8 default/read/poll queues

You can check this right now by looking at dmesg. If you don't
have any poll queues, preadv2 with IOCB_HIPRI will be IRQ based,
not polled. io_uring just tells you this up front with -EOPNOTSUPP.

--
Jens Axboe

La información contenida en este correo electrónico está dirigida únicamente a su destinatario y puede contener información confidencial, material privilegiado o información protegida por derecho de autor. Está prohibida cualquier copia, utilización, indebida retención, modificación, difusión, distribución o reproducción total o parcial. Si usted recibe este mensaje por error, por favor contacte al remitente y elimínelo. La información aquí contenida es responsabilidad exclusiva de su remitente por lo tanto la Universidad EAFIT no se hace responsable de lo que el mensaje contenga. The information contained in this email is addressed to its recipient only and may contain confidential information, privileged material or information protected by copyright. Its prohibited any copy, use, improper retention, modification, dissemination, distribution or total or partial reproduction. If you receive this message by error, please contact the sender and delete it. The information contained herein is the sole responsibility of the sender therefore Universidad EAFIT is not responsible for what the message contains.