Re: [RFC] Programming model for io_uring + eBPF

Christian Dietrich <stettberger@xxxxxxxxxxx> · Wed, 05 May 2021 14:57:48 +0200

Pavel Begunkov <asml.silence@xxxxxxxxx> [01. May 2021]:

> No need to register maps, it's passed as an fd. Looks I have never
> posted any example, but you can even compile fds in in bpf programs
> at runtime.
>
> fd = ...;
> bpf_prog[] = { ..., READ_MAP(fd), ...};
> compile_bpf(bpf_prog);
>
> I was thinking about the opposite, to allow to optionally register
> them to save on atomic ops. Not in the next patch iteration though

That was exactly my idea:  make it possible to register them
optionally.

I also think that registering the eBPF program FD could be made
optionally, since this would be similar to other FD references that are
submitted to io_uring.

> Yes, at least that's how I envision it. But that's a good question
> what else we want to pass. E.g. some arbitrary ids (u64), that's
> forwarded to the program, it can be a pointer to somewhere or just
> a piece of current state.

So, we will have the regular user_data attribute there. So why add
another potential argument to the program, besides passing an eBPF map?

>> One thought that came to my mind: Why do we have to register the eBPF
>> programs and maps? We could also just pass the FDs for those objects in
>> the SQE. As long as there is no other state, it could be the userspaces
>> choice to either attach it or pass it every time. For other FDs we
>> already support both modes, right?
>
> It doesn't register maps (see above). For not registering/attaching in
> advance -- interesting idea, I need it double check with bpf guys.

My feeling would be the following: In any case, at submission we have to
resolve the SQE references to a 'bpf_prog *', if this is done via a
registered program in io_ring_ctx or via calling 'bpf_prog_get_type' is
only a matter of the frontend.

The only difference would be the reference couting. For registered
bpf_progs, we can increment the refcounter only once. For unregistered,
we would have to increment it for every SQE. Don't know if the added
flexibility is worth the effort.

>> If we make it possible to pass some FD to an synchronization object
>> (e.g. semaphore), this might do the trick to support both modes at the interface.

I was thinking further about this. And, really I could not come up with
a proper user-space visible 'object' that we can grab with an FD to
attach this semantic to as futexes come in form of a piece of memory.

So perhaps, we would do something like

    // alloc 3 groups
    io_uring_register(fd, REGISTER_SYNCHRONIZATION_GROUPS, 3);

    // submit a synchronized SQE
    sqe->flags |= IOSQE_SYNCHRONIZE;
    sqe->synchronize_group = 1;     // could probably be restricted to uint8_t.

When looking at this, this could generally be a nice feature to have
with SQEs, or? Hereby, the user could insert all of his SQEs and they
would run sequentially. In contrast to SQE linking, the order of SQEs
would not be determined, which might be beneficial at some point.

>> - FD to synchronization object during the execution
>
> can be in the context, in theory. We need to check available
> synchronisation means

In the context you mean: there could be a pointer to user-memory and the
bpf program calls futex_wait?

> The idea is to add several CQs to a single ring, regardless of BPF,
> and for any request of any type use specify sqe->cq_idx, and CQE
> goes there. And then BPF can benefit from it, sending its requests
> to the right CQ, not necessarily to a single one.
> It will be the responsibility of the userspace to make use of CQs
> right, e.g. allocating CQ per BPF program and so avoiding any sync,
> or do something more complex cross posting to several CQs.

Would this be done at creation time or by registering them? I think that
creation time would be sufficient and it would ease the housekeeping. In
that case, we could get away with making all CQs equal in size and only
read out the number of the additional CQs from the io_uring_params.

When adding a sqe->cq_idx: Do we have to introduce an IOSQE_SELET_OTHER
flag to not break userspace for backwards compatibility or is it
sufficient to assume that the user has zeroed the SQE beforehand?

However, I like the idea of making it a generic field of SQEs.

>> When taking a step back, this is nearly a io_uring_enter(minwait=N)-SQE
>
> Think of it as a separate request type (won't be a separate, but still...)
> waiting on CQ, similarly to io_uring_enter(minwait=N) but doesn't block.
>
>> with an attached eBPF callback, or? At that point, we are nearly full
>> circle.
>
> What do you mean?

I was just joking: With a minwait=N-like field in the eBPF-SQE, we can
mimic the behavior of io_uring_enter(minwait=N) but with an SQE. On the
result of that eBPF-SQE we can now actually wait with io_uring_enter(..)

chris
-- 
Prof. Dr.-Ing. Christian Dietrich
Operating System Group (E-EXK4)
Technische Universität Hamburg
Am Schwarzenberg-Campus 3 (E)
21073 Hamburg

eMail:  christian.dietrich@xxxxxxx
Tel:    +49 40 42878 2188
WWW:    https://osg.tuhh.de/