Re: [PATCH 5.11] io_uring: don't take fs for recvmsg/sendmsg

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 18.11.20 um 20:50 schrieb Pavel Begunkov:
> On 18/11/2020 16:57, Stefan Metzmacher wrote:
>> Am 18.11.20 um 17:27 schrieb Stefan Metzmacher:
>>> Am 07.11.20 um 17:07 schrieb Pavel Begunkov:
>>>> On 07/11/2020 16:02, Pavel Begunkov wrote:
>>>>> On 07/11/2020 13:46, Stefan Metzmacher wrote:
>>>>>> Hi Pavel,
>>>>>>
>>>>>>> We don't even allow not plain data msg_control, which is disallowed in __sys_{send,revb}msg_sock().
>>>>>>
>>>>>> Can't we better remove these checks and allow msg_control?
>>>>>> For me it's a limitation that I would like to be removed.
>>>>>
>>>>> We can grab fs only in specific situations as you mentioned, by e.g.
>>>>> adding a switch(opcode) in io_prep_async_work(), but that's the easy
>>>>> part. All msg_control should be dealt one by one as they do different
>>>>> things. And it's not the fact that they ever require fs.
>>>>
>>>> BTW, Jens mentioned that there is a queued patch that allows plain
>>>> data msg_control. Are those not enough?
>>>
>>> You mean the PROTO_CMSG_DATA_ONLY check?
>>>
>>> It's not perfect, but better than nothing for a start.
>>
>> What actually have in mind for my smbdirect socket driver [1]:
>>
>> - I have a pipe that got filled by IORING_OP_SPLICE
>> - The data in the pipe need to be "spliced" into a remote RDMA buffers,
>>   but I can't use IORING_OP_SPLICE again, because the RDMA buffer descriptor [2]
>>   array needs to be passed too.
>> - I'd like to use IORING_OP_SENDMSG with MSG_OOB and msg_control.
>>   msg_control would get the RDMA buffer descriptor array and the pipe fd.
> 
> If I get you right, you can't splice again because there is an RDMA header
> that should go before payload data. Is that correct?

No.

> So you would need to do like in the pseudo-code below
> 
> payload = pipe.get_buffers();
> iov[] = {&header, payload};
> sendmsg(iov);

This would be for the TCP case, there I use IORING_OP_SENDMSG with MSG_MORE
followed by a IORING_OP_SPLICE in order to put the SMB2 headers before the
payload buffer, while both result after each other in the byte stream.

With SMB-Direct (a transport for SMB over RDMA) there's basically a bi-directional
byte stream similar to TCP, but using RDMA_SEND pdus use via ib_post_send(IB_WR_SEND) on the
sender and ib_post_recv() on the receiver.

But there are also out of band commands to do direct data placement using
RDMA_READ and RDMA_WRITE using ib_post_send(IB_WR_RDMA_READ) and ib_post_send(IB_WR_RDMA_WRITE),
these verbs require a descriptor for the remote memory, 1. a steering tag (which is some kind of temporary cookie/identifier)
for a memory registration on the remote peer, 2. offset, 3. length.
These are completely independent of the byte stream, but they use the same RDMA connection.

This presentation contains illustrations on pages 19, 20 and 22:
https://www.snia.org/sites/default/files/files2/files2/SDC2011/presentations/tuesday/TomTalpey_GregKramer_SMB%202-2_Over_RDMA.pdf

Typically the client registers a memory region(s) and transfers the descriptor(s) within
the native SMB2 protocol using the "stream" of the SMB-Direct transport.
The server reads or writes from/to that clients memory. In order to do that the server
creates a temporary local memory registration, then it needs to pass the local memory descriptor,
but also the remote memory descriptor to the raw RDMA_READ/WRITE verbs and tell the hardware to
transfer the memory.

What I need is a way to trigger these out of band transfers, the simple approach
would be that userspace pass a buffer (iov) together with the remote memory descriptor
to the kernel. For now a use an ioctl() for that case.

But as io_uring doesn't support generic ioctls, my idea was to use sendmsg(MSG_OOM) instead
and pass the remote memory descriptors via msg_control and the buffer via msg_iov.

In order to avoid memory copies I'd like to use a pipe instead of buffer (iovs),
so my idea would be passing the pipe fd via an additional msg_control element
and use msg_iovlen=0, in order to simulate splice with additional meta data (that's only
needed at the local socket layer).

Do you understand this now, or is it still unclear?

metze

Attachment: signature.asc
Description: OpenPGP digital signature


[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux