Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring

Bernd Schubert <bernd.schubert@xxxxxxxxxxx> · Sat, 3 Aug 2024 01:03:15 +0200

On 6/12/24 16:56, Bernd Schubert wrote:
> On 6/12/24 16:07, Miklos Szeredi wrote:
>> On Wed, 12 Jun 2024 at 15:33, Bernd Schubert <bschubert@xxxxxxx> wrote:
>>
>>> I didn't do that yet, as we are going to use the ring buffer for requests,
>>> i.e. the ring buffer immediately gets all the data from network, there is
>>> no copy. Even if the ring buffer would get data from local disk - there
>>> is no need to use a separate application buffer anymore. And with that
>>> there is just no extra copy
>>
>> Let's just tackle this shared request buffer, as it seems to be a
>> central part of your design.
>>
>> You say the shared buffer is used to immediately get the data from the
>> network (or various other sources), which is completely viable.
>>
>> And then the kernel will do the copy from the shared buffer.  Single copy, fine.
>>
>> But if the buffer wasn't shared?  What would be the difference?
>> Single copy also.
>>
>> Why is the shared buffer better?  I mean it may even be worse due to
>> cache aliasing issues on certain architectures.  copy_to_user() /
>> copy_from_user() are pretty darn efficient.
> 
> Right now we have:
> 
> - Application thread writes into the buffer, then calls io_uring_cmd_done
> 
> I can try to do without mmap and set a pointer to the user buffer in the 
> 80B section of the SQE. I'm not sure if the application is allowed to 
> write into that buffer, possibly/probably we will be forced to use 
> io_uring_cmd_complete_in_task() in all cases (without 19/19 we have that 
> anyway). My greatest fear here is that the extra task has performance 
> implications for sync requests.
> 
> 
>>
>> Why is it better to have that buffer managed by kernel?  Being locked
>> in memory (being unswappable) is probably a disadvantage as well.  And
>> if locking is required, it can be done on the user buffer.
> 
> Well, let me try to give the buffer in the 80B section.
> 
>>
>> And there are all the setup and teardown complexities...
> 
> If the buffer in the 80B section works setup becomes easier, mmap and 
> ioctls go away. Teardown, well, we still need the workaround as we need 
> to handle io_uring_cmd_done, but if you could live with that for the 
> instance, I would ask Jens or Pavel or Ming for help if we could solve 
> that in io-uring itself.
> Is the ring workaround in fuse_dev_release() acceptable for you? Or do 
> you have any another idea about it?
> 
>>

Short update, I have this working for some time now with a hack patch
that just adds in a user buffer (without removing mmap, it is just
unused). Initially I thought that is a lot slower, but after removing
all the kernel debug options perf loss is just around 5% and I think I
can get back the remaining by having iov_iter_get_pages2() of the user
buffer in the initialization (with additional code overhead).

I hope to have new patches by mid of next week. I also want to get rid
of the difference of buffer layout between uring and /dev/fuse as that
can be troublesome for other changes like alignment. That might require
an io-uring CQE128, though.

Thanks,
Bernd