Re: [PATCH 7/8] io_uring: support readv/writev with fixed buffers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 17/11/2020 22:59, Bijan Mottahedeh wrote:
>>> Support readv/writev with fixed buffers, and introduce IOSQE_FIXED_BUFFER,
>>> consistent with fixed files.
>>
>> I don't like it at all, see issues below. The actual implementation would
>> be much uglier.
>>
>> I propose you to split the series and push separately. Your first 6 patches
>> first, I don't have conceptual objections to them. Then registration sharing
>> (I still need to look it up). And then we can return to this, if you're not
>> yet convinced.
> 
> Ok.  The sharing patch is actually the highest priority for us so it'd be great to know if you think it's in the right direction.
> 
> Should I submit them as they are or address your fixed_file_ref comments in Patch 4/8 as well?  Would I need your prep patch beforehand?

I remembered one more thing that I need to do for your patches to work.
I'll ping you afterwards

> 
>>> +static ssize_t io_import_iovec_fixed(int rw, struct io_kiocb *req, void *buf,
>>> +                     unsigned segs, unsigned fast_segs,
>>> +                     struct iovec **iovec,
>>> +                     struct iov_iter *iter)
>>> +{
>>> +    struct io_ring_ctx *ctx = req->ctx;
>>> +    struct io_mapped_ubuf *imu;
>>> +    struct iovec *iov;
>>> +    u16 index, buf_index;
>>> +    ssize_t ret;
>>> +    unsigned long seg;
>>> +
>>> +    if (unlikely(!ctx->buf_data))
>>> +        return -EFAULT;
>>> +
>>> +    ret = import_iovec(rw, buf, segs, fast_segs, iovec, iter);
>>
>> Did you test it? import_iovec() does access_ok() against each iov_base,
>> which in your case are an index.
> 
> I used liburing:test/file-{register,update} as models for the equivalent buffer tests and they seem to work.  I can send out the tests and the liburing changes if you want.

Hmm, seems access_ok() is no-op for many architectures.
Still IIRC it's not portable.

> 
> The fixed io test registers an empty iov table first:
> 
>     ret = io_uring_register_buffers(ring, iovs, UIO_MAXIOV);
> 
> It next updates the table with two actual buffers at offset 768:
> 
>         ret = io_uring_register_buffers_update(ring, 768, ups, 2);
> 
> It later uses the buffer at index 768 for writev similar to the file-register test (IOSQE_FIXED_BUFFER instead of IOSQE_FIXED_FILE):
> 
>         iovs[768].iov_base = (void *)768;
>         iovs[768].iov_len = pagesize;
> 
> 
>         io_uring_prep_writev(sqe, fd, iovs, 1, 0);
>         sqe->flags |= IOSQE_FIXED_BUFFER;
> 
>         ret = io_uring_submit(ring);
> 
> Below is the iovec returned from
> 
> io_import_iovec_fixed()
> -> io_import_vec()
> 
> {iov_base = 0x300 <dm_early_create+412>, iov_len = 4096}
> 
>>> +    if (ret < 0)
>>> +        return ret;
>>> +
>>> +    iov = (struct iovec *)iter->iov;
>>> +
>>> +    for (seg = 0; seg < iter->nr_segs; seg++) {
>>> +        buf_index = *(u16 *)(&iov[seg].iov_base);
>>
>> That's ugly, and also not consistent with rw_fixed, because iov_base is
>> used to calculate offset.
> 
> Would offset be applicable when using readv/writev?

Not sure what you mean. You can register a huge chunk of memory,
but with buf_index alone you can't select a subchunk. E.g.

void *buf = alloc(4G);
idx = io_uring_register_buf(buf);
uring_read_fixed(reg_buf=idx, 
		data=buf+100, // offset 100
		size=sz);

This writes [buf+100, buf+100+sz]. Without passing @buf you
wouldn't be able to specify an offset. Alternatively, the
API could have been accepting an offset directly, but this
option was chosen.

There is no such a problem with non-registered versions, it
just writes/reads all iov.

> 
> My thinkig was that for those cases, each iovec should be used exactly as registered.

It may be confusing, but read/write_fixed use iov_base to calculate offset.
i.e.

imu = ctx->bufs[req->buffer_index];
if (iov_base not in range(imu->original_addr, imu->len))
	fail;
offset = iov_base - imu->original_addr;

> 
>>
>>> +        if (unlikely(buf_index < 0 || buf_index >= ctx->nr_user_bufs))
>>> +            return -EFAULT;
>>> +
>>> +        index = array_index_nospec(buf_index, ctx->nr_user_bufs);
>>> +        imu = io_buf_from_index(ctx, index);
>>> +        if (!imu->ubuf || !imu->len)
>>> +            return -EFAULT;
>>> +        if (iov[seg].iov_len > imu->len)
>>> +            return -EFAULT;
>>> +
>>> +        iov[seg].iov_base = (void *)imu->ubuf;
>>
>> Nope, that's not different from non registered version.
>> What import_fixed actually do is setting up the iter argument to point
>> to a bvec (a vector of struct page *).
> 
> So in fact, the buffers end up being pinned again because they are not being as bvecs?

right, it just passes iov with userspace virtual addresses in your case
and layer below don't know that they're pinned. And as they're virtual
in most cases they have to be translated to physical (that's solved by
having a vector of pages).

> 
>>
>> So it either would need to keep a vector of bvecs, that's a vector of vectors,
>> that's not supported by iter, etc., so you'll also need to iterate over them
>> in io_read/write and so on. Or flat 2D structure into 1D, but that's still ugly.
> 
> So you're saying there's no clean way to create a readv/writev + fixed buffers API?  It would've been nice to have a consistent API between files and buffers.

I guess you can register an iov as a single bvec by flatting out the structure
to a single vector (bvec). Though, we'd need to drop an assumption that all
but first and last entries are page sized.

Or do that online with extra overhead + allocations.

> 
> 
>>> @@ -5692,7 +5743,7 @@ static int io_timeout_remove_prep(struct io_kiocb *req,
>>>   {
>>>       if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
>>>           return -EINVAL;
>>> -    if (unlikely(req->flags & (REQ_F_FIXED_FILE | REQ_F_BUFFER_SELECT)))
>>> +    if (unlikely(req->flags & (REQ_F_FIXED_RSRC | REQ_F_BUFFER_SELECT)))
>>
>> Why it's here?
>>
>> #define REQ_F_FIXED_RSRC    (REQ_F_FIXED_FILE | REQ_F_FIXED_BUFFER)
>> So, why do you | with REQ_F_BUFFER_SELECT again here?
> 
> I thought to group fixed files/buffers but distinguish them from selected buffers.

Ah, I've got this one wrong.

> 
>>> @@ -87,6 +88,8 @@ enum {
>>>   #define IOSQE_ASYNC        (1U << IOSQE_ASYNC_BIT)
>>>   /* select buffer from sqe->buf_group */
>>>   #define IOSQE_BUFFER_SELECT    (1U << IOSQE_BUFFER_SELECT_BIT)
>>> +/* use fixed buffer set */
>>> +#define IOSQE_FIXED_BUFFER    (1U << IOSQE_FIXED_BUFFER_BIT)
>>
>> Unfortenatuly, we're almost out of flags bits -- it's a 1 byte
>> field and 6 bits are already taken. Let's not use it.

-- 
Pavel Begunkov



[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux