[I removed the failing netapp/zufs CCs]
On 4/22/22 14:25, Miklos Szeredi wrote:
On Mon, 28 Mar 2022 at 15:21, Bernd Schubert <bschubert@xxxxxxx> wrote:
I would like to discuss the user thread wake up latency in
fuse_dev_do_read(). Profiling fuse shows there is room for improvement
regarding memory copies and splice. The basic profiling with flame graphs
didn't reveal, though, why fuse is so much
slower (with an overlay file system) than just accessing the underlying
file system directly and also didn't reveal why a single threaded fuse
uses less than 100% cpu, with the application on top of use also using
less than 100% cpu (simple bonnie++ runs with 1B files).
So I started to suspect the wait queues and indeed, keeping the thread
that reads the fuse device for work running for some time gives quite
some improvements.
Might be related: I experimented with wake_up_sync() that didn't meet
my expectations. See this thread:
https://lore.kernel.org/all/1638780405-38026-1-git-send-email-quic_pragalla@xxxxxxxxxxx/#r
Possibly fuse needs some wake up tweaks due to its special scheduling
requirements.
Thanks I will look at that as well. I have a patch with spinning and
avoid of thread wake that is almost complete and in my (still limited)
testing almost does not take more CPU and improves meta data / bonnie
performance in between factor ~1.9 and 3, depending on in which
performance mode the cpu is.
https://github.com/aakefbs/linux/commits/v5.17-fuse-scheduling3
Missing is just another option for wake-queue-size trigger and handling
of signals. Should be ready once I'm done with my other work.
That being said, in the mean time I do believe a better approach would
be SQ/CQ like, similar to NVME or io_uring. In principle exactly as
io_uring, just the other way around - kernel fills in SQ, user space
consumes it and fills CQ. We also looked into zufs and your fuse2 branch
and were almost ready to start to port it to a recent kernel, but it is
still all systemcall based and has waitq's - probably much slower than
what could be achieved through queue pairs. Assuming userspace would not
want a polling thread, but would want a notification similar to
io_uring_enter(), there would be still a thread needed to be woken up,
may that is where wake_up_sync() would help.
Btw, the optional kernel polling thread in io_uring also has spinning...
Bernd