Am 18.11.20 um 20:50 schrieb Pavel Begunkov: > On 18/11/2020 16:57, Stefan Metzmacher wrote: >> Am 18.11.20 um 17:27 schrieb Stefan Metzmacher: >>> Am 07.11.20 um 17:07 schrieb Pavel Begunkov: >>>> On 07/11/2020 16:02, Pavel Begunkov wrote: >>>>> On 07/11/2020 13:46, Stefan Metzmacher wrote: >>>>>> Hi Pavel, >>>>>> >>>>>>> We don't even allow not plain data msg_control, which is disallowed in __sys_{send,revb}msg_sock(). >>>>>> >>>>>> Can't we better remove these checks and allow msg_control? >>>>>> For me it's a limitation that I would like to be removed. >>>>> >>>>> We can grab fs only in specific situations as you mentioned, by e.g. >>>>> adding a switch(opcode) in io_prep_async_work(), but that's the easy >>>>> part. All msg_control should be dealt one by one as they do different >>>>> things. And it's not the fact that they ever require fs. >>>> >>>> BTW, Jens mentioned that there is a queued patch that allows plain >>>> data msg_control. Are those not enough? >>> >>> You mean the PROTO_CMSG_DATA_ONLY check? >>> >>> It's not perfect, but better than nothing for a start. >> >> What actually have in mind for my smbdirect socket driver [1]: >> >> - I have a pipe that got filled by IORING_OP_SPLICE >> - The data in the pipe need to be "spliced" into a remote RDMA buffers, >> but I can't use IORING_OP_SPLICE again, because the RDMA buffer descriptor [2] >> array needs to be passed too. >> - I'd like to use IORING_OP_SENDMSG with MSG_OOB and msg_control. >> msg_control would get the RDMA buffer descriptor array and the pipe fd. > > If I get you right, you can't splice again because there is an RDMA header > that should go before payload data. Is that correct? No. > So you would need to do like in the pseudo-code below > > payload = pipe.get_buffers(); > iov[] = {&header, payload}; > sendmsg(iov); This would be for the TCP case, there I use IORING_OP_SENDMSG with MSG_MORE followed by a IORING_OP_SPLICE in order to put the SMB2 headers before the payload buffer, while both result after each other in the byte stream. With SMB-Direct (a transport for SMB over RDMA) there's basically a bi-directional byte stream similar to TCP, but using RDMA_SEND pdus use via ib_post_send(IB_WR_SEND) on the sender and ib_post_recv() on the receiver. But there are also out of band commands to do direct data placement using RDMA_READ and RDMA_WRITE using ib_post_send(IB_WR_RDMA_READ) and ib_post_send(IB_WR_RDMA_WRITE), these verbs require a descriptor for the remote memory, 1. a steering tag (which is some kind of temporary cookie/identifier) for a memory registration on the remote peer, 2. offset, 3. length. These are completely independent of the byte stream, but they use the same RDMA connection. This presentation contains illustrations on pages 19, 20 and 22: https://www.snia.org/sites/default/files/files2/files2/SDC2011/presentations/tuesday/TomTalpey_GregKramer_SMB%202-2_Over_RDMA.pdf Typically the client registers a memory region(s) and transfers the descriptor(s) within the native SMB2 protocol using the "stream" of the SMB-Direct transport. The server reads or writes from/to that clients memory. In order to do that the server creates a temporary local memory registration, then it needs to pass the local memory descriptor, but also the remote memory descriptor to the raw RDMA_READ/WRITE verbs and tell the hardware to transfer the memory. What I need is a way to trigger these out of band transfers, the simple approach would be that userspace pass a buffer (iov) together with the remote memory descriptor to the kernel. For now a use an ioctl() for that case. But as io_uring doesn't support generic ioctls, my idea was to use sendmsg(MSG_OOM) instead and pass the remote memory descriptors via msg_control and the buffer via msg_iov. In order to avoid memory copies I'd like to use a pipe instead of buffer (iovs), so my idea would be passing the pipe fd via an additional msg_control element and use msg_iovlen=0, in order to simulate splice with additional meta data (that's only needed at the local socket layer). Do you understand this now, or is it still unclear? metze
Attachment:
signature.asc
Description: OpenPGP digital signature