Re: [PATCH 0/4] implement vectored registered buffers for sendzc

Jens Axboe <axboe@xxxxxxxxx> · Thu, 24 Oct 2024 16:22:18 -0600

On 10/24/24 4:14 PM, Pavel Begunkov wrote:
> On 10/24/24 20:56, Jens Axboe wrote:
>> On 10/24/24 12:13 PM, Pavel Begunkov wrote:
>>> On 10/24/24 19:00, Jens Axboe wrote:
>>>> On 10/24/24 11:56 AM, Pavel Begunkov wrote:
>>>>> On 10/24/24 18:19, Jens Axboe wrote:
>>>>>> On 10/24/24 10:06 AM, Pavel Begunkov wrote:
>>>>>>> On 10/24/24 16:45, Jens Axboe wrote:
>>> ...>>>> Seems like you're agreeing but then stating the opposite, there
>>>>>>> is some confusion. I'm saying that IMHO the right API wise way
>>>>>>> is resolving an imu at issue time, just like it's done for fixed
>>>>>>> files, and what your recent series did for send zc.
>>>>>>
>>>>>> Yeah early morning confusion I guess. And I do agree in principle,
>>>>>> though for registered buffers, those have to be registered upfront
>>>>>> anyway, so no confusion possible with prep vs issue there. For provided
>>>>>> buffers, it only matters for the legacy ones, which generally should not
>>>>>> be used. Doesn't change the fact that you're technically correct, the
>>>>>> right time to resolve them would be at issue time.
>>>>>
>>>>> I'm talking about sendmsg with iovec. Registered buffers should
>>>>> be registered upfront, that's right, but iovec should be copied
>>>>> at prep, and finally resolved into bvecs incl the imu/buffer lookup
>>>>> at the issue time. And those are two different points in time,
>>>>> maybe because of links, draining or anything else. And if they
>>>>> should be at different moments, there is no way to do it while
>>>>> copying iovec.
>>>>
>>>> Oh I totally follow, the incremental approach would only work if it can
>>>> be done at prep time. If at issue time, then it has to turn an existing
>>>> iovec array into the appropriate bvec array. And that's where you'd have
>>>> to do some clever bits to avoid holding both a full bvec and iovec array
>>>> in memory, which would be pretty wasteful/inefficient. If done at issue
>>>
>>> Why would it be wasteful and inefficient? No more than jumping
>>> though that incremental infra for each chunk, doubling the size
>>> of the array / reallocating / memcpy'ing it, instead of a tight
>>> loop doing the entire conversion.
>>
>> Because it would prevent doing an iovec at-the-time import, then turning
>> it into the desired bvec. That's one loop instead of two. You would have
>> the space upfront, there should be no need to realloc+memcpy. And then
>> there's the space concern, where the initial import is an iovec, and
>> then you need a bvec. For 64-bit that's fine as they take up the same
>> amount of space,
> 
> That's not true, each iov can produce multiple bvec entries so
> iovs might get overwritten if you do it the simplest way.

What part isn't true? Yeah one iovec can turn into multiple bvec
segments, the provided send zc stuff I sent does deal with that. So yeah
it's not necessarily a 1:1 mapping, and even if they have the same size,
you may need more elements on the bvec size.

Doesn't change the fact that you can loop once and do it. If you need to
expand the bvec size, that would be a realloc+copy. But that part is
true even if you first import all iovecs, and then iterate them to map
the bvecs. Unless you do some upfront tracking to know how many elements
you need, but that would seem overly convoluted. With caching, the
expansion should be a rare occurence outside of the initial import into
a new region.

>> but for 32-bit it'd make incremental importing from a
>> stable iovec to a bvec array a bit more tricky (and would need realloc,
>> unless you over-alloc'ed for the iovec array upfront).
> 
> And that's not true, you can still well do it in place if
> iovec is placed right in the memory, which I explicitly
> noted there are simple enough ways to do it in place
> without extra reallocs.

I don't think anything stated there is untrue, just saying it's a bit
more tricky. Which is certainly true, if it's the same memory region and
there's overlaps. But let's just see the code for it, much easier to
discuss over those parts rather than pontificate hypotheticals :-)

-- 
Jens Axboe