Re: [RFC] Programming model for io_uring + eBPF

Pavel Begunkov <asml.silence@xxxxxxxxx> · Tue, 18 May 2021 15:39:31 +0100

On 5/12/21 12:20 PM, Christian Dietrich wrote:
> Pavel Begunkov <asml.silence@xxxxxxxxx> [07. May 2021]:
> 
>>> The following SQE would become: Append this SQE to the SQE-link chain
>>> with the name '1'. If the link chain has completed, start a new one.
>>> Thereby, the user could add an SQE to an existing link chain, even other
>>> SQEs are already submitted.
>>>
>>>>     sqe->flags |= IOSQE_SYNCHRONIZE;
>>>>     sqe->synchronize_group = 1;     // could probably be restricted to uint8_t.
>>>
>>> Implementation wise, we would hold a pointer to the last element of the
>>> implicitly generated link chain.
>>
>> It will be in the common path hurting performance for those not using
>> it, and with no clear benefit that can't be implemented in userspace.
>> And io_uring is thin enough for all those extra ifs to affect end
>> performance.
>>
>> Let's consider if we run out of userspace options.
> 
> So summarize my proposal: I want io_uring to support implicit
> synchronization by sequentialization at submit time. Doing this would
> avoid the overheads of locking (and potentially sleeping).
> 
> So the problem that I see with a userspace solution is the following:
> If I want to sequentialize an SQE with another SQE that was submitted
> waaaaaay earlier, the usual IOSQE_IO_LINK cannot be used as I cannot the
> the link flag of that already submitted SQE. Therefore, I would have to
> wait in userspace for the CQE and submit my second SQE lateron.
> 
> Especially if the goal is to remain in Kernelspace as long as possible
> via eBPF-SQEs this is not optimal.
> 
>> Such things go really horribly with performant APIs as io_uring, even
>> if not used. Just see IOSQE_IO_DRAIN, it maybe almost never used but
>> still in the hot path.
> 
> If we extend the semantic of IOSEQ_IO_LINK instead of introducing a new
> flag, we should be able to limit the problem, or?
> 
> - With synchronize_group=0, the usual link-the-next SQE semantic could
>   remain.
> - While synchronize_group!=0 could expose the described synchronization
>   semantic.
> 
> Thereby, the overhead is at least hidden behind the existing check for
> IOSEQ_IO_LINK, which is there anyway. Do you consider IOSQE_IO_LINK=1
> part of the hot path?

Let's clarify in case I misunderstood you. In a snippet below, should
it serialise execution of sqe1 and sqe2, so they don't run
concurrently? Once request is submitted we don't keep an explicit
reference to it, and it's hard and unreliably trying to find it, so
would not really be "submission" time, but would require additional
locking:

1) either on completion of a request it looks up its group, but
then submission should do +1 spinlock to keep e.g. a list for each
group.
2) or try to find a running request and append to its linked list,
but that won't work.
3) or do some other magic, but all options would rather be far from
free.

If it shouldn't serialise it this case, then I don't see much
difference with IOSEQ_IO_LINK.

prep_sqe1(group=1);
submit();
prep_sqe2(group=1);
submit();

-- 
Pavel Begunkov