On 6/12/24 16:56, Bernd Schubert wrote: > On 6/12/24 16:07, Miklos Szeredi wrote: >> On Wed, 12 Jun 2024 at 15:33, Bernd Schubert <bschubert@xxxxxxx> wrote: >> >>> I didn't do that yet, as we are going to use the ring buffer for requests, >>> i.e. the ring buffer immediately gets all the data from network, there is >>> no copy. Even if the ring buffer would get data from local disk - there >>> is no need to use a separate application buffer anymore. And with that >>> there is just no extra copy >> >> Let's just tackle this shared request buffer, as it seems to be a >> central part of your design. >> >> You say the shared buffer is used to immediately get the data from the >> network (or various other sources), which is completely viable. >> >> And then the kernel will do the copy from the shared buffer. Single copy, fine. >> >> But if the buffer wasn't shared? What would be the difference? >> Single copy also. >> >> Why is the shared buffer better? I mean it may even be worse due to >> cache aliasing issues on certain architectures. copy_to_user() / >> copy_from_user() are pretty darn efficient. > > Right now we have: > > - Application thread writes into the buffer, then calls io_uring_cmd_done > > I can try to do without mmap and set a pointer to the user buffer in the > 80B section of the SQE. I'm not sure if the application is allowed to > write into that buffer, possibly/probably we will be forced to use > io_uring_cmd_complete_in_task() in all cases (without 19/19 we have that > anyway). My greatest fear here is that the extra task has performance > implications for sync requests. > > >> >> Why is it better to have that buffer managed by kernel? Being locked >> in memory (being unswappable) is probably a disadvantage as well. And >> if locking is required, it can be done on the user buffer. > > Well, let me try to give the buffer in the 80B section. > >> >> And there are all the setup and teardown complexities... > > If the buffer in the 80B section works setup becomes easier, mmap and > ioctls go away. Teardown, well, we still need the workaround as we need > to handle io_uring_cmd_done, but if you could live with that for the > instance, I would ask Jens or Pavel or Ming for help if we could solve > that in io-uring itself. > Is the ring workaround in fuse_dev_release() acceptable for you? Or do > you have any another idea about it? > >> Short update, I have this working for some time now with a hack patch that just adds in a user buffer (without removing mmap, it is just unused). Initially I thought that is a lot slower, but after removing all the kernel debug options perf loss is just around 5% and I think I can get back the remaining by having iov_iter_get_pages2() of the user buffer in the initialization (with additional code overhead). I hope to have new patches by mid of next week. I also want to get rid of the difference of buffer layout between uring and /dev/fuse as that can be troublesome for other changes like alignment. That might require an io-uring CQE128, though. Thanks, Bernd