Re: [LSF/MM/BPF TOPIC] programmable IO control flow with io_uring and BPF

Jens Axboe <axboe@xxxxxxxxx> · Fri, 31 Jan 2020 14:30:06 -0700

On 1/24/20 7:18 AM, Pavel Begunkov wrote:
> Apart from concurrent IO execution, io_uring allows to issue a sequence
> of operations, a.k.a links, where requests are executed sequentially one
> after another. If an "error" happened, the rest of the link will be
> cancelled.
> 
> The problem is what to consider an "error". For example, if we
> read less bytes than have been asked for, the link will be cancelled.
> It's necessary to play safe here, but this implies a lot of overhead if
> that isn't the desired behaviour. The user would need to reap all
> cancelled requests, analyse the state, resubmit them and suffer from
> context switches and all in-kernel preparation work. And there are
> dozens of possibly desirable patterns, so it's just not viable to
> hard-code them into the kernel.
> 
> The other problem is to keep in running even when a request depends on
> a result of the previous one. It could be simple passing return code or
> something more fancy, like reading from the userspace.
> 
> And that's where BPF will be extremely useful. It will control the flow
> and do steering.
> 
> The concept is to be able run a BPF program after a request's
> completion, taking the request's state, and doing some of the following:
> 1. drop a link/request
> 2. issue new requests
> 3. link/unlink requests
> 4. do fast calculations / accumulate data
> 5. emit information to the userspace (e.g. via ring's CQ)
> 
> With that, it will be possible to have almost context-switch-less IO,
> and that's really tempting considering how fast current devices are.
> 
> What to discuss:
> 1. use cases
> 2. control flow for non-privileged users (e.g. allowing some popular
>    pre-registered patterns)
> 3. what input the program needs (e.g. last request's
>    io_uring_cqe) and how to pass it.
> 4. whether we need notification via CQ for each cancelled/requested
>    request, because sometimes they only add noise
> 5. BPF access to user data (e.g. allow to read only registered buffers)
> 6. implementation details. E.g.
>    - how to ask to run BPF (e.g. with a new opcode)
>    - having global BPF, bound to an io_uring instance or mixed
>    - program state and how to register
>    - rework notion of draining and sequencing
>    - live-lock avoidance (e.g. double check io_uring shut-down code)

I think this is a key topic that we should absolutely discuss at LSFMM.

-- 
Jens Axboe