Re: [RFC] Programming model for io_uring + eBPF

Pavel Begunkov <asml.silence@xxxxxxxxx> · Fri, 7 May 2021 16:10:14 +0100

On 5/5/21 1:57 PM, Christian Dietrich wrote:
> Pavel Begunkov <asml.silence@xxxxxxxxx> [01. May 2021]:
> 
>> No need to register maps, it's passed as an fd. Looks I have never
>> posted any example, but you can even compile fds in in bpf programs
>> at runtime.
>>
>> fd = ...;
>> bpf_prog[] = { ..., READ_MAP(fd), ...};
>> compile_bpf(bpf_prog);
>>
>> I was thinking about the opposite, to allow to optionally register
>> them to save on atomic ops. Not in the next patch iteration though
> 
> That was exactly my idea:  make it possible to register them
> optionally.
> 
> I also think that registering the eBPF program FD could be made
> optionally, since this would be similar to other FD references that are
> submitted to io_uring.

To wrap it, map registration wasn't required nor possible from
the beginning, but can be interesting to allow it optionally. And
the opposite for programs.

>> Yes, at least that's how I envision it. But that's a good question
>> what else we want to pass. E.g. some arbitrary ids (u64), that's
>> forwarded to the program, it can be a pointer to somewhere or just
>> a piece of current state.
> 
> So, we will have the regular user_data attribute there. So why add
> another potential argument to the program, besides passing an eBPF map?

It may be more flexible or performant, I'm not eager adding more to
it, but we'll see if it's good enough as is once we write some larger
bpf programs.

>>> One thought that came to my mind: Why do we have to register the eBPF
>>> programs and maps? We could also just pass the FDs for those objects in
>>> the SQE. As long as there is no other state, it could be the userspaces
>>> choice to either attach it or pass it every time. For other FDs we
>>> already support both modes, right?
>>
>> It doesn't register maps (see above). For not registering/attaching in
>> advance -- interesting idea, I need it double check with bpf guys.
> 
> My feeling would be the following: In any case, at submission we have to
> resolve the SQE references to a 'bpf_prog *', if this is done via a
> registered program in io_ring_ctx or via calling 'bpf_prog_get_type' is
> only a matter of the frontend.

Yes, I took on this point.

> The only difference would be the reference couting. For registered
> bpf_progs, we can increment the refcounter only once. For unregistered,
> we would have to increment it for every SQE. Don't know if the added
> flexibility is worth the effort.
> 
>>> If we make it possible to pass some FD to an synchronization object
>>> (e.g. semaphore), this might do the trick to support both modes at the interface.
> 
> I was thinking further about this. And, really I could not come up with
> a proper user-space visible 'object' that we can grab with an FD to
> attach this semantic to as futexes come in form of a piece of memory.

Right, not like, but exactly futexes. There is an interest in having
futex requests, so need to write it up.

> So perhaps, we would do something like
> 
>     // alloc 3 groups
>     io_uring_register(fd, REGISTER_SYNCHRONIZATION_GROUPS, 3);
> 
>     // submit a synchronized SQE
>     sqe->flags |= IOSQE_SYNCHRONIZE;
>     sqe->synchronize_group = 1;     // could probably be restricted to uint8_t.
> 
> When looking at this, this could generally be a nice feature to have
> with SQEs, or? Hereby, the user could insert all of his SQEs and they
> would run sequentially. In contrast to SQE linking, the order of SQEs
> would not be determined, which might be beneficial at some point.

It will be in the common path hurting performance for those not using
it, and with no clear benefit that can't be implemented in userspace.
And io_uring is thin enough for all those extra ifs to affect end
performance.

Let's consider if we run out of userspace options.

>>> - FD to synchronization object during the execution
>>
>> can be in the context, in theory. We need to check available
>> synchronisation means
> 
> In the context you mean: there could be a pointer to user-memory and the
> bpf program calls futex_wait?
> 
>> The idea is to add several CQs to a single ring, regardless of BPF,
>> and for any request of any type use specify sqe->cq_idx, and CQE
>> goes there. And then BPF can benefit from it, sending its requests
>> to the right CQ, not necessarily to a single one.
>> It will be the responsibility of the userspace to make use of CQs
>> right, e.g. allocating CQ per BPF program and so avoiding any sync,
>> or do something more complex cross posting to several CQs.
> 
> Would this be done at creation time or by registering them? I think that
> creation time would be sufficient and it would ease the housekeeping. In
> that case, we could get away with making all CQs equal in size and only
> read out the number of the additional CQs from the io_uring_params.

At creation time. Don't see a reason to move it to post creation, and
see a small optimisation weighting for it.

> 
> When adding a sqe->cq_idx: Do we have to introduce an IOSQE_SELET_OTHER
> flag to not break userspace for backwards compatibility or is it
> sufficient to assume that the user has zeroed the SQE beforehand?

Fine by me assuming zeros.

> However, I like the idea of making it a generic field of SQEs.
> 
>>> When taking a step back, this is nearly a io_uring_enter(minwait=N)-SQE
>>
>> Think of it as a separate request type (won't be a separate, but still...)
>> waiting on CQ, similarly to io_uring_enter(minwait=N) but doesn't block.
>>
>>> with an attached eBPF callback, or? At that point, we are nearly full
>>> circle.
>>
>> What do you mean?
> 
> I was just joking: With a minwait=N-like field in the eBPF-SQE, we can
> mimic the behavior of io_uring_enter(minwait=N) but with an SQE. On the
> result of that eBPF-SQE we can now actually wait with io_uring_enter(..)

-- 
Pavel Begunkov