On Tue, Nov 7, 2023 at 12:44 PM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote: > > On 11/06, Willem de Bruijn wrote: > > > > > > I think my other issue with MSG_SOCK_DEVMEM being on recvmsg is that > > > > > > it somehow implies that I have an option of passing or not passing it > > > > > > for an individual system call. > > > > > > If we know that we're going to use dmabuf with the socket, maybe we > > > > > > should move this flag to the socket() syscall? > > > > > > > > > > > > fd = socket(AF_INET6, SOCK_STREAM, SOCK_DEVMEM); > > > > > > > > > > > > ? > > > > > > > > > > I think it should then be a setsockopt called before any data is > > > > > exchanged, with no change of modifying mode later. We generally use > > > > > setsockopts for the mode of a socket. This use of the protocol field > > > > > in socket() for setting a mode would be novel. Also, it might miss > > > > > passively opened connections, or be overly restrictive: one approach > > > > > for all accepted child sockets. > > > > > > > > I was thinking this is similar to SOCK_CLOEXEC or SOCK_NONBLOCK? There > > > > are plenty of bits we can grab. But setsockopt works as well! > > > > > > To follow up: if we have this flag on a socket, not on a per-message > > > basis, can we also use recvmsg for the recycling part maybe? > > > > > > while (true) { > > > memset(msg, 0, ...); > > > > > > /* receive the tokens */ > > > ret = recvmsg(fd, &msg, 0); > > > > > > /* recycle the tokens from the above recvmsg() */ > > > ret = recvmsg(fd, &msg, MSG_RECYCLE); > > > } > > > > > > recvmsg + MSG_RECYCLE can parse the same format that regular recvmsg > > > exports (SO_DEVMEM_OFFSET) and we can also add extra cmsg option > > > to recycle a range. > > > > > > Will this be more straightforward than a setsockopt(SO_DEVMEM_DONTNEED)? > > > Or is it more confusing? > > > > It would have to be sendmsg, as recvmsg is a copy_to_user operation. > > > > > > I am not aware of any precedent in multiplexing the data stream and a > > control operation stream in this manner. It would also require adding > > a branch in the sendmsg hot path. > > Is it too much plumbing to copy_from_user msg_control deep in recvmsg > stack where we need it? Mixing in sendmsg is indeed ugly :-( I tried exactly the inverse of that when originally adding MSG_ZEROCOPY: to allow piggy-backing zerocopy completion notifications on sendmsg calls by writing to sendmsg msg_control on return to user. It required significant code churn, which the performance gains did not warrant. Doing so also breaks the simple rule that recv is for reading and send is for writing. > Regarding hot patch: aren't we already doing copy_to_user for the tokens in > this hot path, so having one extra condition shouldn't hurt too much? We're doing that in the optional cmsg handling of recvmsg, which is already a slow path (compared to the data read() itself). > > The memory is associated with the socket, freed when the socket is > > closed as well as on SO_DEVMEM_DONTNEED. Fundamentally it is a socket > > state operation, for which setsockopt is the socket interface. > > > > Is your request purely a dislike, or is there some technical concern > > with BPF and setsockopt? > > It's mostly because I've been bitten too much by custom socket options that > are not really on/off/update-value operations: > > 29ebbba7d461 - bpf: Don't EFAULT for {g,s}setsockopt with wrong optlen > 00e74ae08638 - bpf: Don't EFAULT for getsockopt with optval=NULL > 9cacf81f8161 - bpf: Remove extra lock_sock for TCP_ZEROCOPY_RECEIVE > d8fe449a9c51 - bpf: Don't return EINVAL from {get,set}sockopt when optlen > PAGE_SIZE > > I do agree that this particular case of SO_DEVMEM_DONTNEED seems ok, but > things tend to evolve and change. I see. I'm a bit concerned if we start limiting what we can do in sockets because of dependencies that BPF processing places on them. The use case for BPF [gs]etsockopt is limited to specific control mode calls. Would it make sense to just exclude calls like SO_DEVMEM_DONTNEED from this interpositioning? At a high level what we really want is a high rate metadata path from user to kernel. And there are no perfect solutions. From kernel to user we use the socket error queue for this. That was never intended for high event rate itself, dealing with ICMP errors and the like before timestamps and zerocopy notifications were added. If I squint hard enough I can see some prior art in mixing data and high rate state changes within the same channel in NIC descriptor queues, where some devices do this, e.g., { "insert encryption key", "send packet" }. But fundamentally I think we should keep the socket queues for data only.