Pavel Begunkov <asml.silence@xxxxxxxxx> [21. May 2021]: >> The problem that I see is that eBPF in io_uring breaks this fine >> synchronization as eBPF SQE submission and userspace SQE submission can >> run in parallel. > > It definitely won't be a part of ABI, but they actually do serialise > at the moment. They serialize because they are executed by the same worker thread, right? Perhaps that is the solution to my synchronization problem. If/when io_uring supports more than one eBPF executioners, we should make the number of executors configurable at setup time. Thereby, the user can implicitly manage serialization of eBPF execution. >> But going back to my original wish: I wanted to ensure that I can >> serialize eBPF-SQEs such that I'm sure that they do not run in parallel. >> My idea was to use synchronization groups as a generalization of >> SQE linking in order to make it also useful for others (not only for eBPF). > > So, let's dissect it a bit more, why do you need serialising as > such? What use case you have in mind, and let's see if it's indeed > can't be implemented efficiently with what we have. What I want to do is to manipulate (read-calculate-update) user memory from eBPF without the need to synchronize between eBPF invocations. As eBPF invocations have a run-to-completion semantic, it feels bad to use lock-based synchronization. Besides waiting for user-memory to be swapped in, they will be usually short and plug together results and newly emitted CQEs. > To recap: BPFs don't share SQ with userspace at all, and may have > separate CQs to reap events from. You may post an event and it's > wait synchronised, so may act as a message-based synchronisation, > see test3 in the recently posted v2 for example. I'll also be > adding futex support (bpf + separate requests), it might > play handy for some users. I'm sure that it is possible to use those mechanisms for synchronizing, but I assume that explicit synchronization (locks, passing tokens around) is more expensive than sequenzializing requests (implicit synchronization) before starting to execute them. But probably, we need some benchmarks to see what performs better. >> My reasoning being not doing this serialization in userspace is that I >> want to use the SQPOLL mode and execute long chains of >> IO/computation-SQEs without leaving the kernelspace. > > btw, "in userspace" is now more vague as it can be done by BPF > as well. For some use cases I'd expect BPF acting as a reactor, > e.g. on completion of previous CQEs and submitting new requests > in response, and so keeping it entirely in kernel space until > it have anything to tell to the userspace, e.g. by posting > into the main CQ. Yes, exactly that is my motivation. But I also think that it is a useful pattern to have many eBPF callbacks pending (e.g. one for each connection). With one pending invocation per connection, synchronization with a fixed number of additional CQEs might be problematic: For example, for a per-connection barrier synchronization with the CQ-reap approach, one needs one CQ for each connection. >> The problem that I had when thinking about the implementation is that >> IO_LINK semantic works in the wrong direction: Link the next SQE, >> whenever it comes to this SQE. If it would be the other way around >> ("Link this SQE to the previous one") it would be much easier as the >> cost would only arise if we actually request linking. But compatibility.. > > Stack vs queue style linking? If I understand what you mean right, that's > because this is how SQ is parsed and so that's the most efficient way. No, I did not want to argue about the ordering within the link chain, but with the semantic of link flag. I though that it might have been beneficial to let the flag indicate that the SQE should be linked to the previous one. However, after thinking this through in more detail I now believe that it does not make any difference for the submission path. >> Ok, but what happens if the last SQE in an io_submit_sqes() call >> requests linking? Is it envisioned that the first SQE that comes with >> the next io_submit_sqes() is linked to that one? > > No, it doesn't leave submission boundary (e.g. a single syscall). > In theory may be left there _not_ submitted, but don't see much > profit in it. > >> If this is not supported, what happens if I use the SQPOLL mode where >> the poller thread can partition my submitted SQEs at an arbitrary >> point into multiple io_submit_sqes() calls? > > It's not arbitrary, submission is atomic in nature, first you fill > SQEs in memory but they are not visible to SQPOLL in a meanwhile, > and then you "commit" them by overwriting SQ tail pointer. > > Not a great exception for that is shared SQPOLL task, but it > just waits someone to take care of the case. > > if (cap_entries && to_submit > 8) > to_submit = 8; > >> If this is supported, link.head has to point to the last submitted SQE after >> the first io_submit_sqes()-call. Isn't then appending SQEs in the >> second io_submit_sqes()-call racy with the completion side. (With the >> same problems that I tried to solve? > > Exactly why it's not supported Thank you for this detailed explanation. I now understand the design decision with the SQE linking much better and why delayed linking of SQEs introduces linking between completion and submission side that is undesirable. chris -- Prof. Dr.-Ing. Christian Dietrich Operating System Group (E-EXK4) Technische Universität Hamburg Am Schwarzenberg-Campus 3 (E), 4.092 21073 Hamburg eMail: christian.dietrich@xxxxxxx Tel: +49 40 42878 2188 WWW: https://osg.tuhh.de/