Re: [RFC PATCH 00/13] fuse uring communication

Bernd Schubert <bernd.schubert@xxxxxxxxxxx> · Thu, 23 Mar 2023 12:18:11 +0100

On 3/21/23 10:35, Amir Goldstein wrote:
On Tue, Mar 21, 2023 at 3:11 AM Bernd Schubert <bschubert@xxxxxxx> wrote:

This adds support for uring communication between kernel and
userspace daemon using opcode the IORING_OP_URING_CMD. The basic
appraoch was taken from ublk.  The patches are in RFC state -
I'm not sure about all decisions and some questions are marked
with XXX.

Userspace side has to send IOCTL(s) to configure ring queue(s)
and it has the choice to configure exactly one ring or one
ring per core. If there are use case we can also consider
to allow a different number of rings - the ioctl configuration
option is rather generic (number of queues).

Right now a queue lock is taken for any ring entry state change,
mostly to correctly handle unmount/daemon-stop. In fact,
correctly stopping the ring took most of the development
time - always new corner cases came up.
I had run dozens of xfstest cycles,
versions I had once seen a warning about the ring start_stop
mutex being the wrong state - probably another stop issue,
but I have not been able to track it down yet.
Regarding the queue lock - I still need to do profiling, but
my assumption is that it should not matter for the
one-ring-per-core configuration. For the single ring config
option lock contention might come up, but I see this
configuration mostly for development only.
Adding more complexity and protecting ring entries with
their own locks can be done later.

Current code also keep the fuse request allocation, initially
I only had that for background requests when the ring queue
didn't have free entries anymore. The allocation is done
to reduce initial complexity, especially also for ring stop.
The allocation free mode can be added back later.

Right now always the ring queue of the submitting core
is used, especially for page cached background requests
we might consider later to also enqueue on other core queues
(when these are not busy, of course).

Splice/zero-copy is not supported yet, all requests go
through the shared memory queue entry buffer. I also
following splice and ublk/zc copy discussions, I will
look into these options in the next days/weeks.
To have that buffer allocated on the right numa node,
a vmalloc is done per ring queue and on the numa node
userspace daemon side asks for.
My assumption is that the mmap offset parameter will be
part of a debate and I'm curious what other think about
that appraoch.

Benchmarking and tuning is on my agenda for the next
days. For now I only have xfstest results - most longer
running tests were running at about 2x, but somehow when
I cleaned up the patches for submission I lost that.
My development VM/kernel has all sanitizers enabled -
hard to profile what happened. Performance
results with profiling will be submitted in a few days.

When posting those benchmarks and with future RFC posting,
it's would be useful for people reading this introduction for the
first time, to explicitly state the motivation of your work, which
can only be inferred from the mention of "benchmarks".

I think it would also be useful to link to prior work (ZUFS, fuse2)
and mention the current FUSE performance issues related to
context switches and cache line bouncing that was discussed in
those threads.

Oh yes sure, entirely forgot to mention the motivation. Will do in the 
next patch round. You don't have these links by any chance? I know that 
there were several zufs threads, but I don't remember discussions about 
cache line - maybe I had missed it. I can try to read through the old 
threads, in case you don't have it.
Our own motivation for ring basically comes from atomic-open benchmarks, 
which gave totally confusing benchmark results in multi threaded libfuse 
mode - less requests caused lower IOPs - switching to single threaded 
then gave expect IOP increase. Part of it was due to a libfuse issue - 
persistent thread destruction/creation due to min_idle_threads, but 
another part can be explained with thread switching only. When I added 
(slight) spinning in fuse_dev_do_read(), the hard part/impossible part 
was to avoid letting multiple threads spin - even with a single threaded 
application creating/deleting files (like bonnie++), multiple libfuse 
threads started to spin for no good reason. So spinning resulted in a 
much improved latency, but high cpu usage, because multiple threads were 
spinning. I will add those explanations to the next patch set.

Thanks,
Bernd