On 16/04/2021 13:02, Christian Dietrich wrote: > Hello, > > we [1] are operating-system researchers from Germany and noticed the LWN > article about the combination of io_uring and eBPF programs. We think > that high-performance system-call interfaces in combination with > programability are the perfect substrate for system-call clustering. > Therefore, we had some intense discussions how the combination of eBPF > and io_uring could look like, and we would like to present our ideas > here. In a nutshell, our goal is to give applications the possibility to > perform complex I/O operations, or even their whole application logic, > with less kernel-userspace transitions. Hi Christian, Horst, Franz-Bernhard, It's nice to hear from you again. Sounds good, we need to have this discussion to refine interfaces. The model makes much sense, but let me outline some details and how it matches up from the kernel and implementation perspective > > First, we want to present our idea what the execution model should look > like in order to give the user program the possibility to program its > I/O behavior, before we give a few coarse examples how this execution > model can be used to structure complex I/O operations. > > Execution model > ================ > > Think of the execution model as a list of short high-level statements > about the capabilities that the io_uring+eBPF interface should provide. > For each point, in-detail design decisions will be necessary to > integrate it well with the kernel. > > 1. Program model > 1. **Loading**: An eBPF program, which is a collection of functions, > can be attached to an uring. Each uring has at most one loaded > eBPF program An ebpf program is basically a single entry point (i.e. function), but there is some flexibility on reusing them, iirc even dynamically, not link-time only. So, from the io_uring/bpf perspective, you load a bunch of bpf programs (i.e. separate functions) ... > 2. **Invocation**: Individual functions in that eBPF program can be > called. > 3. These eBPF programs have some kind of global state that remains > valid across invocations. ... where the state is provided from the userspace in a form of bpf map/array. So, it's not necessarily restricted by a single state, but you can have many per io_uring. Hence, if userspace provides a single piece of state, that should be as you described. > > 2. Invocation model > 1. An invocation to any function in the loaded eBPF program can > be queued as an SQE. > 2. If an invocation SQE is part of an SQE link chain, it is > (as usual) executed after the preceeding I/O operation has > finished. Right, just like any other linked request > 3. All invocations on the same uring are serialized in their > execution -- even if the SQEs are part of different link > chains that are otherwise executed in parallel. Don't think it's a good idea to limit the kernel to it, but we may try to find a userspace solution, or do it optionally. > 4. Invocations have run-to-completion semantics. > > 3. Access to previous SQEs/CQEs (in a link chain) > 1. In a link chain, the invocation can access the data of > (**all** or **the** -- subject to debate and application-scenario > requirements) previous SQEs and CQEs. > 2. All SQEs get a flag that indicates that no CQE is emitted to > userspace. However, the CQE is always generated and can be > consumed by following invocation-SQEs. There is my last concept, was thinking about because feeding CQE of only last linked request seems very limited to me. - let's allow to register multiple completion queues (CQs), for simplicity extra ones are limited to BPF programs and not exported to the userspace, may be lifted in the future - BPF requests are naturally allowed to reap CQEs from them and free to choose from which exactly CQ (including from multiple). So, that solves 2) without flags because CQEs are posted, but userspace can write BPF in such a way that they are reaped only by a specific BPF request/etc. and not affecting others. As well as effectively CQEs part of 1), without burden of extra allocations for CQEs, etc. That's flexible enough to enable users to do many interesting things, e.g. to project it onto graph-like events synchronisation models. This may need some other things, e.g. like allowing BPF requests to wait for >=N entries in CQ, but not of much concern. > 4. Data and control access within an invocation: Invocations can ... > 1. Access preceeding SQEs and CQEs of the same link chain (see above). It'd be a problem to allow them to access preceeding SQEs. They're never saved in requests, and by the time of execution previous requests are already gone. And SQE is 64B, even alloc + copy would add too much of overhead. > 2. Emit SQEs and CQEs to the same uring. All SQEs that are submitted > by one invocation follow each other directly in the submission > ring to enable SQE linking. fwiw, no need to use the SQ for from BPF submissions, especially since it may be used by the userspace in parallel and synchronises races is painful. > 3. Read and write the whole userspace memory of the uring-owning > process. That's almost completely on BPF. > We think that this execution model provides a few important design > considerations that make programming I/O feasible to application > developers: > > With the capability to access previous SQEs (3.1) and userspace memory > (4.3), invocations can directly manipulate the read and written results > of previous I/O operations. There, we could also avoid to provide buffer > allocation/deallocation mechanisms for invocations. While this might be > a point of debate for the Linux kernel, it is definitely something we > want to try in our academic setting. > > With the serialization guarantees (2.3 and 2.4), the user has not to > worry about synchronization within the eBPF program itself, and we > somehow mimic the (unluckily successful) execution model of JavaScript. > > With the possibility to submit linked pairs of I/O-SQE and > invocation-SQE (4.2), an invocation can submit several parallelly > running I/O requests. With the possibility to supress emitting CQEs to > userspace (3.2), long chains of computation+I/O can be executed without > disturbing userspace. > > By having global state (1.3) and the serialization guarantees, complex > synchronization (e.g. barrier) can be implemented within such eBPF > programs. Great job writing up requirements and use cases. Let's see how we can improve it. My main concerns is 1) to keep it flexible, so use-cases and ideas may bounce around, and allow multiple models to co-exists. 2) but also keep it slim enough, because wouldn't be of much use if it's faster to go through userspace and normal submissions. Let me know what you think > > > # Application Examples > > In order to make our imagined execution model more plastic, we want to > give a few examples how our proposed interface could be used to program > I/O. For this, we have drawn a pretty picture: > > https://i.imgur.com/HADDfrv.png > Also attached with this mail as an PDF. > > > > > > > ## Postprocess > > We submit two linked SQEs into the queue. The first one performs an I/O > operation and does not emit a CQE. The second is an invocation that can > access the previous SQE, the CQE, and the resulting buffer. After some > postprocessing, it emits a CQE to the completion queue. > > ## I/O -> Computation -> I/O Loops > > This is an extension to the postprocess example. Instead of always > emitting a CQE, the invocation can also decide to queue another pair of > I/O+invocation SQEs to perform another round of postprocessed I/O. > > An example for such an I/O loop could be to parse the header of some > network protocol. The loop parses incoming data until the header is > complete or until some unhandled corner case occurs. In these cases, the > loop can **escalate** to the userspace by emitting an CQE. > > > ## Fork-Join Parallelism > > As an invocation can emit multiple SQEs to its io_uring, it can issue > multiple I/O requests at once that are executed in parallel. In order to > wait for all requests to complete, it initializes a global counter in > its program state and links an invocation to each parallel requests that > counts completions. The last invocation continues the io-program: > > > int counter; > void start_par() { > sqe_barrier->opcode = IORING_OP_EBPF; > ... > sqe_barrier->addr = &finish_par; > counter = 10; > for (int i=0; i<10 ;i++) { > sqe_io->opcode = IORING_OP_READV... > sqe_io->flags = ... | IOSQE_IO_LINK | IOSEQ_IO_NOCQE | ... > ... > uring_submit_sqe(seq_io); > uring_submit_sqe(seq_barrier) > } > } > > void finish_par () { > if (--counter > 0) return; > > // All 10 packets are completed > uring_submit_cqe(....) > } > > > The synchronization with a global variable works as all eBPF programs > within one uring are serialized, regardless of their linking state. > > # Our next steps > > With Franz (Cc), we have a Master student who wants to explore this > execution model. We will first build a prototype in userspace on basis > of Pavel's eBPF patch series. For this prototype, we will put a proxy > before the completion queue to mimic the linked invocation-SQEs and the > data access to userspace. The proxy will execute normal C functions that > adhere to the rules of eBPF as callbacks to completed I/O SQEs. With > this prototype at hand, we can already use the execution model to > restructure a userspace program to use the new interface, before we work > on the necessary kernel extensions. > > Besides that, we would like to discuss our envisioned execution model on > this mailinglist. What do you think about the presented idea? > > Horst & chris > > [1] Christian Dietrich, https://osg.tuhh.de/People/dietrich/ > Horst Schirmeier, https://ess.cs.tu-dortmund.de/~hsc/ > -- Pavel Begunkov