And this got botched too, really not my morning. 2b posted with the right subject line, and the right patches too... On 2/25/20 9:19 AM, Jens Axboe wrote: > With the poll retry based async IO patchset I posted last week, the one > big missing thing for me was the ability to have automatic buffer > selection. Generally applications that handle tons of sockets like to > poll for activity on them, then issue IO when they become ready. This is > of course at least two system calls, but it also means that it provides > an application a chance to manage how many IO buffers it needs. With the > io_uring based polled IO, the application need only issue an > IORING_OP_RECV (for example, to receive socket data), it doesn't need to > poll at all. However, this means that the application no longer has an > opportune moment to select how many IO buffers to keep in flight, it has > to be equal to what it currently has pending. > > I had originally intended to use BPF to provide some means of buffer > selection, but I had a hard time imagining how life times of the buffer > could be managed through that. I had a false start today, but Andres > suggested a nifty approach that also solves the life time issue. > > Basically the application registers buffers with the kernel. Each buffer > is registered with a given group ID, and buffer ID. The buffers are > organized by group ID, and the application selects a buffer pool based > on this group ID. One use case might be to group by size. There's an > opcode for this, IORING_OP_PROVIDE_BUFFERS. > > IORING_OP_PROVIDE_BUFFERS takes a start address, length of a buffer, and > number of buffers. It also provides a group ID with which these buffers > should be associated, and a starting buffer ID. The buffers are then > added, and the buffer ID is incremented by 1 for each buffer. > > With that, when doing the same IORING_OP_RECV, no buffer is passed in > with the request. Instead, it's flagged with IOSQE_BUFFER_SELECT, and > sqe->buf_group is filled in with a valid group ID. When the kernel can > satisfy the receive, a buffer is selected from the specified group ID > pool. If none are available, the IO is terminated with -ENOBUFS. On > success, the buffer ID is passed back through the (CQE) completion > event. This tells the application what specific buffer was used. > > A buffer can be used only once. On completion, the application may > choose to free it, or register it again with IORING_OP_PROVIDE_BUFFER. > > Patches can also be found in the below repo: > > https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-buf-select > > and they are obviously layered on top of the poll retry rework. > > Changes since v1: > - Cleanup address space > - Fix locking for async offload issue > - Add lockdep annotation for uring_lock > - Verify sqe fields on PROVIDE_BUFFERS prep > - Fix send/recv kbuf leak on import failure > - Fix send/recv error handling on -ENOBUFS > - Change IORING_OP_PROVIDE_BUFFER to PROVIDE_BUFFERS, and allow multiple > contig buffers in one call > -- Jens Axboe