On Wed, May 29, 2024 at 9:01 PM Bernd Schubert <bschubert@xxxxxxx> wrote: > > From: Bernd Schubert <bschubert@xxxxxxx> > > This adds support for uring communication between kernel and > userspace daemon using opcode the IORING_OP_URING_CMD. The basic > appraoch was taken from ublk. The patches are in RFC state, > some major changes are still to be expected. > > Motivation for these patches is all to increase fuse performance. > In fuse-over-io-uring requests avoid core switching (application > on core X, processing of fuse server on random core Y) and use > shared memory between kernel and userspace to transfer data. > Similar approaches have been taken by ZUFS and FUSE2, though > not over io-uring, but through ioctl IOs > > https://lwn.net/Articles/756625/ > https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git/log/?h=fuse2 > > Avoiding cache line bouncing / numa systems was discussed > between Amir and Miklos before and Miklos had posted > part of the private discussion here > https://lore.kernel.org/linux-fsdevel/CAJfpegtL3NXPNgK1kuJR8kLu3WkVC_ErBPRfToLEiA_0=w3=hA@xxxxxxxxxxxxxx/ > > This cache line bouncing should be addressed by these patches > as well. > > I had also noticed waitq wake-up latencies in fuse before > https://lore.kernel.org/lkml/9326bb76-680f-05f6-6f78-df6170afaa2c@xxxxxxxxxxx/T/ > > This spinning approach helped with performance (>40% improvement > for file creates), but due to random server side thread/core utilization > spinning cannot be well controlled in /dev/fuse mode. > With fuse-over-io-uring requests are handled on the same core > (sync requests) or on core+1 (large async requests) and performance > improvements are achieved without spinning. > > Splice/zero-copy is not supported yet, Ming Lei is working > on io-uring support for ublk_drv, but I think so far there > is no final agreement on the approach to be taken yet. > Fuse-over-io-uring runs significantly faster than reads/writes > over /dev/fuse, even with splice enabled, so missing zc > should not be a blocking issue. > > The patches have been tested with multiple xfstest runs in a VM > (32 cores) with a kernel that has several debug options > enabled (like KASAN and MSAN). > For some tests xfstests reports that O_DIRECT is not supported, > I need to investigate that. Interesting part is that exactly > these tests fail in plain /dev/fuse posix mode. I had to disabled > generic/650, which is enabling/disabling cpu cores - given ring > threads are bound to cores issues with that are no totally > unexpected, but then there (scheduler) kernel messages that > core binding for these threads is removed - this needs > to be further investigates. > Nice effect in io-uring mode is that tests run faster (like > generic/522 ~2400s /dev/fuse vs. ~1600s patched), though still > slow as this is with ASAN/leak-detection/etc. > > The corresponding libfuse patches are on my uring branch, > but need cleanup for submission - will happen during the next > days. > https://github.com/bsbernd/libfuse/tree/uring > > If it should make review easier, patches posted here are on > this branch > https://github.com/bsbernd/linux/tree/fuse-uring-for-6.9-rfc2 > > TODO list for next RFC versions > - Let the ring configure ioctl return information, like mmap/queue-buf size > - Request kernel side address and len for a request - avoid calculation in userspace? > - multiple IO sizes per queue (avoiding a calculation in userspace is probably even > more important) > - FUSE_INTERRUPT handling? > - Logging (adds fields in the ioctl and also ring-request), > any mismatch between client and server is currently very hard to understand > through error codes > > Future work > - notifications, probably on their own ring > - zero copy > > I had run quite some benchmarks with linux-6.2 before LSFMMBPF2023, > which, resulted in some tuning patches (at the end of the > patch series). > > Some benchmark results > ====================== > > System used for the benchmark is a 32 core (HyperThreading enabled) > Xeon E5-2650 system. I don't have local disks attached that could do > >5GB/s IOs, for paged and dio results a patched version of passthrough-hp > was used that bypasses final reads/writes. > > paged reads > ----------- > 128K IO size 1024K IO size > jobs /dev/fuse uring gain /dev/fuse uring gain > 1 1117 1921 1.72 1902 1942 1.02 > 2 2502 3527 1.41 3066 3260 1.06 > 4 5052 6125 1.21 5994 6097 1.02 > 8 6273 10855 1.73 7101 10491 1.48 > 16 6373 11320 1.78 7660 11419 1.49 > 24 6111 9015 1.48 7600 9029 1.19 > 32 5725 7968 1.39 6986 7961 1.14 > > dio reads (1024K) > ----------------- > > jobs /dev/fuse uring gain > 1 2023 3998 2.42 > 2 3375 7950 2.83 > 4 3823 15022 3.58 > 8 7796 22591 2.77 > 16 8520 27864 3.27 > 24 8361 20617 2.55 > 32 8717 12971 1.55 > > mmap reads (4K) > --------------- > (sequential, I probably should have made it random, sequential exposes > a rather interesting/weird 'optimized' memcpy issue - sequential becomes > reversed order 4K read) > https://lore.kernel.org/linux-fsdevel/aae918da-833f-7ec5-ac8a-115d66d80d0e@xxxxxxxxxxx/ > > jobs /dev/fuse uring gain > 1 130 323 2.49 > 2 219 538 2.46 > 4 503 1040 2.07 > 8 1472 2039 1.38 > 16 2191 3518 1.61 > 24 2453 4561 1.86 > 32 2178 5628 2.58 > > (Results on request, setting MAP_HUGETLB much improves performance > for both, io-uring mode then has a slight advantage only.) > > creates/s > ---------- > threads /dev/fuse uring gain > 1 3944 10121 2.57 > 2 8580 24524 2.86 > 4 16628 44426 2.67 > 8 46746 56716 1.21 > 16 79740 102966 1.29 > 20 80284 119502 1.49 > > (the gain drop with >=8 cores needs to be investigated) Hi Bernd, Those are impressive results! When approaching the FUSE uring feature from marketing POV, I think that putting the emphasis on metadata operations is the best approach. Not the dio reads are not important (I know that is part of your use case), but I imagine there are a lot more people out there waiting for improvement in metadata operations overhead. To me it helps to know what the current main pain points are for people using FUSE filesystems wrt performance. Although it may not be uptodate, the most comprehensive study about FUSE performance overhead is this FAST17 paper: https://www.usenix.org/system/files/conference/fast17/fast17-vangoor.pdf In this paper, table 3 summarizes the different overheads observed per workload. According to this table, the workloads that degrade performance worse on an optimized passthrough fs over SSD are: - many file creates - many file deletes - many small file reads In all these workloads, it was millions of files over many directories. The highest performance regression reported was -83% on many small file creations. The moral of this long story is that it would be nice to know what performance improvement FUSE uring can aspire to. This is especially relevant for people that would be interested in combining the benefits of FUSE passthrough (for data) and FUSE uring (for metadata). What did passthrough_hp do in your patched version with creates? Did it actually create the files? In how many directories? Maybe the directory inode lock impeded performance improvement with >=8 threads? > > Remaining TODO list for RFCv3: > -------------------------------- > 1) Let the ring configure ioctl return information, > like mmap/queue-buf size > > Right now libfuse and kernel have lots of duplicated setup code > and any kind of pointer/offset mismatch results in a non-working > ring that is hard to debug - probably better when the kernel does > the calculations and returns that to server side > > 2) In combination with 1, ring requests should retrieve their > userspace address and length from kernel side instead of > calculating it through the mmaped queue buffer on their own. > (Introduction of FUSE_URING_BUF_ADDR_FETCH) > > 3) Add log buffer into the ioctl and ring-request > > This is to provide better error messages (instead of just > errno) > > 3) Multiple IO sizes per queue > > Small IOs and metadata requests do not need large buffer sizes, > we need multiple IO sizes per queue. > > 4) FUSE_INTERRUPT handling > > These are not handled yet, kernel side is probably not difficult > anymore as ring entries take fuse requests through lists. > > Long term TODO: > -------------- > Notifications through io-uring, maybe with a separated ring, > but I'm not sure yet. Is that going to improve performance in any real life workload? Thanks, Amir. > > Changes since RFCv1 > ------------------- > - No need to hold the task of the server side anymore. Also no > ioctls/threads waiting for shutdown anymore. Shutdown now more > works like the traditional fuse way. > - Each queue clones the fuse and device release makes an exception > for io-uring. Reason is that queued IORING_OP_URING_CMD > (through .uring_cmd) prevent a device release. I.e. a killed > server side typically triggers fuse_abort_conn(). This was the > reason for the async stop-monitor in v1 and reference on the daemon > task. However it was very racy and annotated immediately by Miklos. > - In v1 the offset parameter to mmap was identifying the QID, in v2 > server side is expected to send mmap from a core bound ring thread > in numa mode and numa node is taken through the core of that thread. > Kernel side of the mmap buffer is stored in an rbtree and assigned > to the right qid through an additional queue ioctl. > - Release of IORING_OP_URING_CMD is done through lists now, instead > of iterating over the entire array of queues/entries and does not > depend on the entry state anymore (a bit of the state is still left > for sanity check). > - Finding free ring queue entries is done through lists and not through > a bitmap anymore > - Many other code changes and bug fixes > - Performance tunings > > --- > Bernd Schubert (19): > fuse: rename to fuse_dev_end_requests and make non-static > fuse: Move fuse_get_dev to header file > fuse: Move request bits > fuse: Add fuse-io-uring design documentation > fuse: Add a uring config ioctl > Add a vmalloc_node_user function > fuse uring: Add an mmap method > fuse: Add the queue configuration ioctl > fuse: {uring} Add a dev_release exception for fuse-over-io-uring > fuse: {uring} Handle SQEs - register commands > fuse: Add support to copy from/to the ring buffer > fuse: {uring} Add uring sqe commit and fetch support > fuse: {uring} Handle uring shutdown > fuse: {uring} Allow to queue to the ring > export __wake_on_current_cpu > fuse: {uring} Wake requests on the the current cpu > fuse: {uring} Send async requests to qid of core + 1 > fuse: {uring} Set a min cpu offset io-size for reads/writes > fuse: {uring} Optimize async sends > > Documentation/filesystems/fuse-io-uring.rst | 167 ++++ > fs/fuse/Kconfig | 12 + > fs/fuse/Makefile | 1 + > fs/fuse/dev.c | 310 +++++-- > fs/fuse/dev_uring.c | 1232 +++++++++++++++++++++++++++ > fs/fuse/dev_uring_i.h | 395 +++++++++ > fs/fuse/file.c | 15 +- > fs/fuse/fuse_dev_i.h | 67 ++ > fs/fuse/fuse_i.h | 9 + > fs/fuse/inode.c | 3 + > include/linux/vmalloc.h | 1 + > include/uapi/linux/fuse.h | 135 +++ > kernel/sched/wait.c | 1 + > mm/nommu.c | 6 + > mm/vmalloc.c | 41 +- > 15 files changed, 2330 insertions(+), 65 deletions(-) > --- > base-commit: dd5a440a31fae6e459c0d6271dddd62825505361 > change-id: 20240529-fuse-uring-for-6-9-rfc2-out-f0a009005fdf > > Best regards, > -- > Bernd Schubert <bschubert@xxxxxxx> >