Re: [PATCH 07/18] io_uring: support for IO polling

Jann Horn <jannh@xxxxxxxxxx> · Tue, 29 Jan 2019 21:47:50 +0100

On Tue, Jan 29, 2019 at 8:27 PM Jens Axboe <axboe@xxxxxxxxx> wrote:
> Add support for a polled io_uring context. When a read or write is
> submitted to a polled context, the application must poll for completions
> on the CQ ring through io_uring_enter(2). Polled IO may not generate
> IRQ completions, hence they need to be actively found by the application
> itself.
>
> To use polling, io_uring_setup() must be used with the
> IORING_SETUP_IOPOLL flag being set. It is illegal to mix and match
> polled and non-polled IO on an io_uring.
>
> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
[...]
> @@ -102,6 +102,8 @@ struct io_ring_ctx {
>
>         struct {
>                 spinlock_t              completion_lock;
> +               bool                    poll_multi_file;
> +               struct list_head        poll_list;

Please add a comment explaining what protects poll_list against
concurrent modification, and ideally also put lockdep asserts in the
functions that access the list to allow the kernel to sanity-check the
locking at runtime.

As far as I understand:
Elements are added by io_iopoll_req_issued(). io_iopoll_req_issued()
can't race with itself because, depending on IORING_SETUP_SQPOLL,
either you have to come through sys_io_uring_enter() (which takes the
uring_lock), or you have to come from the single-threaded
io_sq_thread().
io_do_iopoll() iterates over the list and removes completed items.
io_do_iopoll() is called through io_iopoll_getevents(), which can be
invoked in two ways during normal operation:
 - sys_io_uring_enter -> __io_uring_enter -> io_iopoll_check
->io_iopoll_getevents; this is only protected by the uring_lock
 - io_sq_thread -> io_iopoll_check ->io_iopoll_getevents; this doesn't
hold any locks
Additionally, the following exit paths:
 - io_sq_thread -> io_iopoll_reap_events -> io_iopoll_getevents
 - io_uring_release -> io_ring_ctx_wait_and_kill ->
io_iopoll_reap_events -> io_iopoll_getevents
 - io_uring_release -> io_ring_ctx_wait_and_kill -> io_ring_ctx_free
-> io_iopoll_reap_events -> io_iopoll_getevents

So as far as I can tell, you can have various races around access to
the poll_list.

[...]
> +static void io_free_req_many(struct io_ring_ctx *ctx, void **reqs, int *nr)
> +{
> +       if (*nr) {
> +               kmem_cache_free_bulk(req_cachep, *nr, reqs);
> +               io_ring_drop_ctx_refs(ctx, *nr);
> +               *nr = 0;
> +       }
> +}
[...]
> +static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events,
> +                              struct list_head *done)
> +{
> +       void *reqs[IO_IOPOLL_BATCH];
> +       struct io_kiocb *req;
> +       int to_free = 0;
> +
> +       while (!list_empty(done)) {
> +               req = list_first_entry(done, struct io_kiocb, list);
> +               list_del(&req->list);
> +
> +               io_cqring_fill_event(ctx, req->user_data, req->error, 0);
> +
> +               reqs[to_free++] = req;
> +               (*nr_events)++;
> +
> +               fput(req->rw.ki_filp);
> +               if (to_free == ARRAY_SIZE(reqs))
> +                       io_free_req_many(ctx, reqs, &to_free);
> +       }
> +       io_commit_cqring(ctx);
> +
> +       if (to_free)
> +               io_free_req_many(ctx, reqs, &to_free);

Nit: You check here whether to_free==0, and then io_free_req_many()
does that again. You can delete one of those checks; I'd probably
delete this one.

> +}
[...]