kernel (or fio) bug with IOSQE_IO_DRAIN

Andres Freund <andres@xxxxxxxxxxx> · Fri, 13 Sep 2019 14:58:46 -0700

Hi Jens,

While I'm not absolutely certain, this looks like a kernel side bug. I
modified fio to set DRAIN (because I'd need that in postgres, to not
actually just loose data as my current prototype does).  But the
submissions currently fail after a very short amount of time, as soon as
I use a queue depth bigger than one.

I modified fio's master like (obviously not intended as an actual fio
patch):

diff --git i/engines/io_uring.c w/engines/io_uring.c
index e5edfcd2..a3fe461f 100644
--- i/engines/io_uring.c
+++ w/engines/io_uring.c
@@ -181,16 +181,20 @@ static int fio_ioring_prep(struct thread_data *td, struct io_u *io_u)
                        sqe->len = 1;
                }
                sqe->off = io_u->offset;
+               sqe->flags |= IOSQE_IO_DRAIN;
        } else if (ddir_sync(io_u->ddir)) {
                if (io_u->ddir == DDIR_SYNC_FILE_RANGE) {
                        sqe->off = f->first_write;
                        sqe->len = f->last_write - f->first_write;
                        sqe->sync_range_flags = td->o.sync_file_range;
                        sqe->opcode = IORING_OP_SYNC_FILE_RANGE;
+
+                       //sqe->flags |= IOSQE_IO_DRAIN;
                } else {
                        if (io_u->ddir == DDIR_DATASYNC)
                                sqe->fsync_flags |= IORING_FSYNC_DATASYNC;
                        sqe->opcode = IORING_OP_FSYNC;
+                       //sqe->flags |= IOSQE_IO_DRAIN;
                }
        }
 

I pretty much immediately get failed requests back with a simple fio
job:

[global]
name=fio-drain-bug

size=1G
nrfiles=1

iodepth=2
ioengine=io_uring

[test]
rw=write


andres@alap4:~/src/fio$ ~/build/fio/install/bin/fio  ~/tmp/fio-drain-bug.fio
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=2
fio-3.15
Starting 1 process
files=0
fio: io_u error on file test.0.0: Invalid argument: write offset=794877952, buflen=4096
fio: pid=3075, err=22/file:/home/andres/src/fio/io_u.c:1787, func=io_u error, error=Invalid argument

test: (groupid=0, jobs=1): err=22 (file:/home/andres/src/fio/io_u.c:1787, func=io_u error, error=Invalid argument): pid=3075: Fri Sep 13 12:45:19 2019


Where I think the EINVAL might come from

	if (unlikely(s->index >= ctx->sq_entries))
		return -EINVAL;

based on the stack trace a perf probe returns (hacketyhack):

kworker/u16:0-e  6304 [006] 28733.064781: probe:__io_submit_sqe: (ffffffff81356719)
        ffffffff8135671a __io_submit_sqe+0xfa (/lib/modules/5.3.0-rc8-andres-00007-g3120b9a6a3f7-dirty/build/vmlinux)
        ffffffff81356da0 io_sq_wq_submit_work+0xe0 (/lib/modules/5.3.0-rc8-andres-00007-g3120b9a6a3f7-dirty/build/vmlinux)
        ffffffff81137392 process_one_work+0x1d2 (/lib/modules/5.3.0-rc8-andres-00007-g3120b9a6a3f7-dirty/build/vmlinux)
        ffffffff8113758d worker_thread+0x4d (/lib/modules/5.3.0-rc8-andres-00007-g3120b9a6a3f7-dirty/build/vmlinux)
        ffffffff8113d1e6 kthread+0x106 (/lib/modules/5.3.0-rc8-andres-00007-g3120b9a6a3f7-dirty/build/vmlinux)
        ffffffff8200021a ret_from_fork+0x3a (/lib/modules/5.3.0-rc8-andres-00007-g3120b9a6a3f7-dirty/build/vmlinux

/home/andres/src/kernel/fs/io_uring.c:1800

which precisely the return:
	if (unlikely(s->index >= ctx->sq_entries))
		return -EINVAL;

This is with your change to limit the number of workqueue threads
applied, but I don't think that should matter.

I suspect that this is hit due to ->index being unitialized, rather than
actually too large. In io_sq_wq_submit_work() the sqe_submit embedded in
io_kiocb is used.  ISTM that e.g. when
io_uring_enter()->io_ring_submit()->io_submit_sqe()
allocates the new io_kiocb and io_queue_sqe()->io_req_defer() and then
submits that to io_sq_wq_submit_work() io_kiocb->submit.index (and some
other fields?) is not initialized.

Greetings,

Andres Freund