Re: [RFC PATCH 00/13] fuse uring communication

Amir Goldstein <amir73il@xxxxxxxxx> · Thu, 23 Mar 2023 13:55:18 +0200

On Thu, Mar 23, 2023 at 1:18 PM Bernd Schubert
<bernd.schubert@xxxxxxxxxxx> wrote:
>
>
>
> On 3/21/23 10:35, Amir Goldstein wrote:
> > On Tue, Mar 21, 2023 at 3:11 AM Bernd Schubert <bschubert@xxxxxxx> wrote:
> >>
> >> This adds support for uring communication between kernel and
> >> userspace daemon using opcode the IORING_OP_URING_CMD. The basic
> >> appraoch was taken from ublk.  The patches are in RFC state -
> >> I'm not sure about all decisions and some questions are marked
> >> with XXX.
> >>
> >> Userspace side has to send IOCTL(s) to configure ring queue(s)
> >> and it has the choice to configure exactly one ring or one
> >> ring per core. If there are use case we can also consider
> >> to allow a different number of rings - the ioctl configuration
> >> option is rather generic (number of queues).
> >>
> >> Right now a queue lock is taken for any ring entry state change,
> >> mostly to correctly handle unmount/daemon-stop. In fact,
> >> correctly stopping the ring took most of the development
> >> time - always new corner cases came up.
> >> I had run dozens of xfstest cycles,
> >> versions I had once seen a warning about the ring start_stop
> >> mutex being the wrong state - probably another stop issue,
> >> but I have not been able to track it down yet.
> >> Regarding the queue lock - I still need to do profiling, but
> >> my assumption is that it should not matter for the
> >> one-ring-per-core configuration. For the single ring config
> >> option lock contention might come up, but I see this
> >> configuration mostly for development only.
> >> Adding more complexity and protecting ring entries with
> >> their own locks can be done later.
> >>
> >> Current code also keep the fuse request allocation, initially
> >> I only had that for background requests when the ring queue
> >> didn't have free entries anymore. The allocation is done
> >> to reduce initial complexity, especially also for ring stop.
> >> The allocation free mode can be added back later.
> >>
> >> Right now always the ring queue of the submitting core
> >> is used, especially for page cached background requests
> >> we might consider later to also enqueue on other core queues
> >> (when these are not busy, of course).
> >>
> >> Splice/zero-copy is not supported yet, all requests go
> >> through the shared memory queue entry buffer. I also
> >> following splice and ublk/zc copy discussions, I will
> >> look into these options in the next days/weeks.
> >> To have that buffer allocated on the right numa node,
> >> a vmalloc is done per ring queue and on the numa node
> >> userspace daemon side asks for.
> >> My assumption is that the mmap offset parameter will be
> >> part of a debate and I'm curious what other think about
> >> that appraoch.
> >>
> >> Benchmarking and tuning is on my agenda for the next
> >> days. For now I only have xfstest results - most longer
> >> running tests were running at about 2x, but somehow when
> >> I cleaned up the patches for submission I lost that.
> >> My development VM/kernel has all sanitizers enabled -
> >> hard to profile what happened. Performance
> >> results with profiling will be submitted in a few days.
> >
> > When posting those benchmarks and with future RFC posting,
> > it's would be useful for people reading this introduction for the
> > first time, to explicitly state the motivation of your work, which
> > can only be inferred from the mention of "benchmarks".
> >
> > I think it would also be useful to link to prior work (ZUFS, fuse2)
> > and mention the current FUSE performance issues related to
> > context switches and cache line bouncing that was discussed in
> > those threads.
>
> Oh yes sure, entirely forgot to mention the motivation. Will do in the
> next patch round. You don't have these links by any chance? I know that

I have this thread which you are on:
https://lore.kernel.org/linux-fsdevel/CAJfpegtjEoE7H8tayLaQHG9fRSBiVuaspnmPr2oQiOZXVB1+7g@xxxxxxxxxxxxxx/

> there were several zufs threads, but I don't remember discussions about
> cache line - maybe I had missed it. I can try to read through the old
> threads, in case you don't have it.

Miklos talked about it somewhere...

> Our own motivation for ring basically comes from atomic-open benchmarks,
> which gave totally confusing benchmark results in multi threaded libfuse
> mode - less requests caused lower IOPs - switching to single threaded
> then gave expect IOP increase. Part of it was due to a libfuse issue -
> persistent thread destruction/creation due to min_idle_threads, but
> another part can be explained with thread switching only. When I added
> (slight) spinning in fuse_dev_do_read(), the hard part/impossible part
> was to avoid letting multiple threads spin - even with a single threaded
> application creating/deleting files (like bonnie++), multiple libfuse
> threads started to spin for no good reason. So spinning resulted in a
> much improved latency, but high cpu usage, because multiple threads were
> spinning. I will add those explanations to the next patch set.
>

That would be great.

Thanks,
Amir.