Sorry for my late reply, I'm on vacation and family visit this week.
On 4/25/22 10:37, Miklos Szeredi wrote:
On Fri, 22 Apr 2022 at 17:46, Bernd Schubert <bernd.schubert@xxxxxxxxxxx> wrote:
[I removed the failing netapp/zufs CCs]
On 4/22/22 14:25, Miklos Szeredi wrote:
On Mon, 28 Mar 2022 at 15:21, Bernd Schubert <bschubert@xxxxxxx> wrote:
I would like to discuss the user thread wake up latency in
fuse_dev_do_read(). Profiling fuse shows there is room for improvement
regarding memory copies and splice. The basic profiling with flame graphs
didn't reveal, though, why fuse is so much
slower (with an overlay file system) than just accessing the underlying
file system directly and also didn't reveal why a single threaded fuse
uses less than 100% cpu, with the application on top of use also using
less than 100% cpu (simple bonnie++ runs with 1B files).
So I started to suspect the wait queues and indeed, keeping the thread
that reads the fuse device for work running for some time gives quite
some improvements.
Might be related: I experimented with wake_up_sync() that didn't meet
my expectations. See this thread:
https://lore.kernel.org/all/1638780405-38026-1-git-send-email-quic_pragalla@xxxxxxxxxxx/#r
Possibly fuse needs some wake up tweaks due to its special scheduling
requirements.
Thanks I will look at that as well. I have a patch with spinning and
avoid of thread wake that is almost complete and in my (still limited)
testing almost does not take more CPU and improves meta data / bonnie
performance in between factor ~1.9 and 3, depending on in which
performance mode the cpu is.
https://github.com/aakefbs/linux/commits/v5.17-fuse-scheduling3
Missing is just another option for wake-queue-size trigger and handling
of signals. Should be ready once I'm done with my other work.
Trying to understand what is being optimized here... does the
following correctly describe your use case?
- an I/O thread is submitting synchronous requests (direct I/O?)
- the fuse thread always goes to sleep, because the request queue is
empty (there's always a single request on the queue)
- with this change the fuse thread spins for a jiffy before going to
sleep, and by that time the I/O thread will submit a new sync request.
- the I/O thread does not spin while the the fuse thread is processing
the request, so it still goes to sleep.
Yes, this describes it well. We basically noticed weird effects with
multiple fuse threads when you had asked for benchmarks of the atomic
create/open patches. In our HPC world the standard for such benchmarks
is to use mdtest, but for simplicity I personally prefer bonnie++, like
"bonnie++ -s0 -n10:1:1:10 -d <dest-path>"
Initial results were rather confusing, as reduced number of requests
could result in lower performance. So I started to investigate and found
a number of issues
1) passthrough_ll is using a single linked list to store inodes, we
later switched to passthrough_hp which uses a C++ map to avoid the O(N)
inode search
2) limiting the number of threads in libfuse using the max_idle_threads
variable caused additional high cpu usage - there was permanent pthread
creation/destruction. I have submitted patches for that (additional
difficulty is to fix the API to avoid uninitialized struct members in
libfuse3)
3) There is some overhead with splice for small requests like meta data.
Even though the libfuse already tries to use splice for larger requests
only. But unless disabled it still does a splice system call for the
request header - enough to introduce a perf penalty. I have some very
experimental patches for that as well, although it got much more
difficult than I had initially hoped for. With these patches applied I
started to profile the system with flame graphs and noticed that
performance is much lower than it could be explained by the fuse cpu
overhead.
4) Figured out about the waitq overhead. In the mean time I'm rather
surprised about the zufs results - benchmarks had been done with at
least n-application thread >= 2 x n-zufs threads? Using thread spinning
might avoid the issue, but with request queue per core in worst case all
system cores might go a bit into spinning mode - at least not idea for
embedded systems. And also not ideal for power consumption on laptops or
phones and neither for HPC systems where systems are supposed to be busy
to do calculations.
4.1) A sub problem of the waitq is the sleep condition - it checks if
there are no pending requests - threads on different cores randomly wake
up, even with avoided explicit thread wake as in my patches.
Right now I'm at a point where I see that my patches help to improve
performance, but I'm not totally happy with the solution myself. That is
basically where I believe that a SQ/CQ approach would give better
performance and should/might avoid additional complexity. At a minimum
the request queue (SQ) spinning could be totally controlled/handled in
user space.
Just need to find the time to code it...
Bernd