On 5/27/21 12:12 PM, Christian Dietrich wrote: > Pavel Begunkov <asml.silence@xxxxxxxxx> [21. May 2021]: > >>> The problem that I see is that eBPF in io_uring breaks this fine >>> synchronization as eBPF SQE submission and userspace SQE submission can >>> run in parallel. >> >> It definitely won't be a part of ABI, but they actually do serialise >> at the moment. > > They serialize because they are executed by the same worker thread, > right? No, because all submissions are currently under ctx->uring_lock mutex and also executed in the submitting task context (no io-wq). I hope to get rid of the second restriction, i.e. prior upstreaming, and maybe do something about the first one in far future if there will be a need. > Perhaps that is the solution to my synchronization problem. If/when > io_uring supports more than one eBPF executioners, we should make the > number of executors configurable at setup time. Thereby, the user can > implicitly manage serialization of eBPF execution. I wouldn't keep it a part of ABI, but may be good enough for experiments. So, it'll be UB and may break unless we figure something out. ... actually it sounds like a sane idea to do lock grabbing lazy, i.e. only if it actually submits requests. That may make sense if you control several requests by BPF, e.g. keeping QD + batching. >>> But going back to my original wish: I wanted to ensure that I can >>> serialize eBPF-SQEs such that I'm sure that they do not run in parallel. >>> My idea was to use synchronization groups as a generalization of >>> SQE linking in order to make it also useful for others (not only for eBPF). >> >> So, let's dissect it a bit more, why do you need serialising as >> such? What use case you have in mind, and let's see if it's indeed >> can't be implemented efficiently with what we have. > > What I want to do is to manipulate (read-calculate-update) user memory > from eBPF without the need to synchronize between eBPF invocations. > > As eBPF invocations have a run-to-completion semantic, it feels bad to > use lock-based synchronization. Besides waiting for user-memory to be > swapped in, they will be usually short and plug together results and > newly emitted CQEs. swapping can't be done under spinlocks, that's a reason why userspace read/write are available only to sleepable BPF. btw, I was thinking about adding atomic ops with userspace memory. I can code an easy solution, but may have too much of additional overhead. Same situation with normal load/stores being slow mentioned below. > >> To recap: BPFs don't share SQ with userspace at all, and may have >> separate CQs to reap events from. You may post an event and it's >> wait synchronised, so may act as a message-based synchronisation, >> see test3 in the recently posted v2 for example. I'll also be >> adding futex support (bpf + separate requests), it might >> play handy for some users. > > I'm sure that it is possible to use those mechanisms for synchronizing, > but I assume that explicit synchronization (locks, passing tokens > around) is more expensive than sequenzializing requests (implicit > synchronization) before starting to execute them. As you know it depends on work/synchronisation ratio, but I'm also concerned about scalability. Consider same argument applied to sequenzializing normal user threads (e.g. pthreads). Not quite of the same extent but shows the idea. > But probably, we need some benchmarks to see what performs better. Would be great to have some in any case. Userspace read/write may be slow. It can be solved (if a problem) but would require work to be done on the BPF kernel side >>> My reasoning being not doing this serialization in userspace is that I >>> want to use the SQPOLL mode and execute long chains of >>> IO/computation-SQEs without leaving the kernelspace. >> >> btw, "in userspace" is now more vague as it can be done by BPF >> as well. For some use cases I'd expect BPF acting as a reactor, >> e.g. on completion of previous CQEs and submitting new requests >> in response, and so keeping it entirely in kernel space until >> it have anything to tell to the userspace, e.g. by posting >> into the main CQ. > > Yes, exactly that is my motivation. But I also think that it is a useful > pattern to have many eBPF callbacks pending (e.g. one for each > connection). Good case. One more moment is how much generality such cases need. E.g. there are some data structures in BPF, don't remember names but like perf buffers or perf rings. > > With one pending invocation per connection, synchronization with a fixed > number of additional CQEs might be problematic: For example, for a > per-connection barrier synchronization with the CQ-reap approach, one > needs one CQ for each connection. > >>> The problem that I had when thinking about the implementation is that >>> IO_LINK semantic works in the wrong direction: Link the next SQE, >>> whenever it comes to this SQE. If it would be the other way around >>> ("Link this SQE to the previous one") it would be much easier as the >>> cost would only arise if we actually request linking. But compatibility.. >> >> Stack vs queue style linking? If I understand what you mean right, that's >> because this is how SQ is parsed and so that's the most efficient way. > > No, I did not want to argue about the ordering within the link chain, > but with the semantic of link flag. I though that it might have been > beneficial to let the flag indicate that the SQE should be linked to the > previous one. However, after thinking this through in more detail I now > believe that it does not make any difference for the submission path. > > >>> Ok, but what happens if the last SQE in an io_submit_sqes() call >>> requests linking? Is it envisioned that the first SQE that comes with >>> the next io_submit_sqes() is linked to that one? >> >> No, it doesn't leave submission boundary (e.g. a single syscall). >> In theory may be left there _not_ submitted, but don't see much >> profit in it. >> >>> If this is not supported, what happens if I use the SQPOLL mode where >>> the poller thread can partition my submitted SQEs at an arbitrary >>> point into multiple io_submit_sqes() calls? >> >> It's not arbitrary, submission is atomic in nature, first you fill >> SQEs in memory but they are not visible to SQPOLL in a meanwhile, >> and then you "commit" them by overwriting SQ tail pointer. >> >> Not a great exception for that is shared SQPOLL task, but it >> just waits someone to take care of the case. >> >> if (cap_entries && to_submit > 8) >> to_submit = 8; >> >>> If this is supported, link.head has to point to the last submitted SQE after >>> the first io_submit_sqes()-call. Isn't then appending SQEs in the >>> second io_submit_sqes()-call racy with the completion side. (With the >>> same problems that I tried to solve? >> >> Exactly why it's not supported > > Thank you for this detailed explanation. I now understand the design > decision with the SQE linking much better and why delayed linking of > SQEs introduces linking between completion and submission side that is > undesirable. You're welcome, hope we'll get API into shape in no time -- Pavel Begunkov